Wmatrix corpus analysis and comparison tool


Wmatrix corpus analysis and comparison tool
Wmatrix is a corpus analysis and comparison software tool. It provides a web interface to the USAS and CLAWS corpus annotation tools, and standard corpus linguistic methodologies such as frequency lists and concordances. It also extends the keywords method to key grammatical categories and key semantic fields.

Wmatrix allows the user to run these tools via a web browser such as Netscape or Internet Explorer, and so will run on any computer (Mac, Windows PC, Linux, Unix) with a web browser and a network connection. Wmatrix was developed by Paul Rayson initially in the REVERE project, extended and applied to corpus linguistics during PhD work and is still being updated regularly. Earlier versions were available for Unix via terminal-based command line access (tmatrix) and Unix via Xwindows (Xmatrix), but these only offer retrieval of text pre-annotated with USAS and CLAWS.

In this introduction to Wmatrix: screenshots, references for Wmatrix, and example applications and publications.

Mini-tutorial for Wmatrix: with step-by-step instructions on how to compare Liberal Democrat and Labour Party Manifestos for the 2005 UK General Election.

Access the tool online at http://ucrel.lancs.ac.uk/wmatrix.html
Note: if this shorter link is down, you can connect to Wmatrix directly.

Usernames for Wmatrix are free to members of Lancaster University. If you would like access to Wmatrix, please contact Paul Rayson.
Usernames for academic research and teaching: (non-Lancaster users) A free one-month trial is available for individual users, please contact Paul Rayson to set up a username and password. Once the one-month trial has expired, usernames are available for around £100 (depending on the exchange rate) per username per year from the online secure order page hosted at regsoft. Multiple usernames (or years) may be purchased at a reduced cost. Please ask Paul for details. Further development and external availability of Wmatrix currently depends on licensing its use.


Introduction to Wmatrix (click images to enlarge)

Wmatrix users can upload their own corpus data to the system, so that it can be automatically annotated and viewed via the web browser. Each file is stored in a workarea (equivalent to a folder in Windows or directory on Unix).
Input format guidelines
The format of the text analysed in Wmatrix can be 'raw' or 'HTML'. The analysis may be improved with some pre-editing of the input text, although pre-editing is not normally required. There are guidelines provided for texts to be tagged by CLAWS. Most important is the replacement of less-than (<) and greater-than (>) characters by the corresponding SGML entity references (&lt;) and (&gt;) respectively. If the text contains HTML, SGML or XML tags then it is best to select the 'HTML' input format. If the text contains less-than or greater-than symbols in formulae, for example, then CLAWS may mistake large quantities of the following text for SGML tags, or fail to POS tag the file. The guidelines mention start and end text markers, but these are not required since they are inserted for you by Wmatrix.

Tag wizard
Wmatrix users can upload their file and complete the automatic tagging process by clicking on the tag wizard. Once the file has been uploaded to the web server, it is POS tagged by CLAWS and semantically tagged by USAS. This process can be carried out step by step starting with the 'manual load file' option. As a shortcut you can simply upload frequency profiles if you have them. The format for a frequency list is a very simple two column format with a total line at the head of the file. You can see an example of this. The column widths are not significant.

View of workarea
By clicking on the workarea name, the user can see its contents. Following the application of the tag wizard, the workarea contains the original text, POS and semantically tagged versions of that text, and a set of frequency profiles.
Viewpoints present different views on the same data. Each viewpoint represents a different set of frequency lists and queries applied to the data. Some viewpoints have built-in functions. Users can save lists and queries for any viewpoint, and then create new viewpoints by cloning existing ones.

Frequency profiles
From the workarea view, the user can click on a frequency list to see the most frequent items in their corpus. Frequency lists are available for words, POS tags and semantic tags. The lists can be sorted alphabetically or by frequency.

From the frequency list view, the user can click on context and see standard concordances. These are key item in context (KIIC) concordances because they can show all occurrences for words in one POS or semantic category.

Key words, key word classes and key concepts: comparison of frequency lists
From the workarea view, the user can click on compare frequency list to perform a comparison of the frequency list for their corpus against another larger normative corpus such as the BNC sampler, or against another of their own texts (once that text has been loaded into Wmatrix). This comparison can be carried out at the word level to see keywords, or at the POS (to see key word classes), or at the semantic level (to see key concepts). The log-likelihood statistic is employed by Wmatrix. For more details, see the log-likelihood calculator. The semantic frequency list has an option 'compare to normative BNCIT'. This lets you compare the concept frequencies against a slightly different corpus (a subcorpus of the BNC) which has been semantically tagged. From this comparison you can 'list' the words in each concept or show 'context' to see the words in their original context in your text. Further information on the BNCIT corpus can be found on the REVERE project page. In all cases the key comparison shows the most significant key items towards the top of the list since the result is sorted on the LL (log-likelihood) field which shows how significant the difference is. You should just look at items with a '+' code since this shows overuse in your text as compared to the standard English corpora. To be statistically significant you should look at items with a LL value over about 7, since 6.63 is the cut-off for 99% confidence of significance.


A handout describing Wmatrix is available as Adobe PDF:
Rayson, P. (2001). Wmatrix: a web-based corpus processing environment. Software demonstration presented at ICAME 2001 conference, Université catholique de Louvain, Belgium. May 16-20, 2001. (PDF handout)
Please reference Wmatrix as follows:
Rayson, P. (2005) Wmatrix: a web-based corpus processing environment, Computing Department, Lancaster University. http://www.comp.lancs.ac.uk/ucrel/wmatrix/
Rayson, P. (2003). Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Ph.D. thesis, Lancaster University. (abstract or full text )


Systems engineering: see the publications listed under the REVERE project (1999-2002).
Frequency profile comparison of written and spoken English: See Leech, G., Rayson, P., and Wilson, A. (2001). Word Frequencies in Written and Spoken English: based on the British National Corpus. Longman, London. (see the companion website for more details)
Training chatbots: comparison of human-human and human-machine dialogues. See Abu Shawar, Bayan; Atwell, Eric. Using dialogue corpora to train a chatbot. In Archer, D, Rayson, P, Wilson, A & McEnery, T (editors) Proceedings of CL2003: International Conference on Corpus Linguistics, pp. 681-690 Lancaster University. 2003.
Computer content analysis: analysis of interview transcripts.
Computer content analysis of political discourse. See Xin Huang (2003) A Computer-aided Diachronic Content Analysis of Twentieth Century Political Discourse in China. MA dissertation in Language Studies, Lancaster University.
Keyword analysis: See Marilyn Deegan, Harold Short, Dawn Archer, Paul Baker, Tony McEnery, Paul Rayson (2004) Computational Linguistics Meets Metadata, or the Automatic Extraction of Key Words from Full Text Content. RLG Diginews, Vol. 8, No. 2. ISSN 1093-5371.
Key word-class analysis for EAP: See Jones, M., Rayson, P. and Leech, G. (2004) Key category analysis of a spoken corpus for EAP. Presented at The 2nd Inter-Varietal Applied Corpus Studies (IVACS) International Conference on "Analyzing Discourse in Context" The Graduate School of Education, Queens University, Belfast, Northern Ireland, 25 - 26 June, 2004.
Phraseology: Magali Paquot, Sylviane Granger, Paul Rayson and Cdrick Fairon (forthcoming) Extraction of multi-word units from EFL and native English corpora: The phraseology of the verb 'make'. To be presented at Europhras, European Society of Phraseology, 26-29 August 2004, Basel, Switzerland.
Key word analysis for digital libraries: Walkerdine, J. and Rayson, P. (2004) P2P-4-DL: Digital Library over Peer-to-Peer. In Caronni G., Weiler N., Shahmehri N. (eds.) Proceedings of Fourth IEEE International Conference on Peer-to-Peer Computing (PSP2004) 25-27 August 2004, Zurich, Switzerland. IEEE Computer Society Press, pp. 264-265. ISBN 0-7695-2156-8.
Comparison of political party manifestos: (Labour versus LibDem UK 2001 General Election) Paul Rayson (2004). Keywords are not enough. Invited talk for JAECS (Japan Association for English Corpus Studies) at Chuo University, Tokyo, Japan, 27th November 2004. (slides)
回复: Wmatrix corpus analysis and comparison tool

Dear XuSun575,
have you applied for a username of Wmatrix sucessfully? I've sent 2 e-mails to the author,but he didn't rely.