Using Xaira to extract keywords

xiaoz

永远的超级管理员
Staff member
Thanks to my colleague Andrew Hardie, a program is now available to extract wordSmith-style keywords from Xaira wordlists. It should work on all languages and it has been tested on a range of languages including English, Nepali and Chinese.

To use the program, you will need to create a wordlist for your target corpus and reference corpus respective in Xaira Client. The whoe process is described as follows:

1) Create a folder Xairakeys (or whatever you like) on your machine;
2) Download and unzip the two files into that folder;
3) Open your target corpus in Xaira Client;
4) Click on the Word Query button;
5) Check the bos for Control and select All word forms (frequency >0);
6) Click on Lookup and a word list will appear;
7) Click on Save and select XML;
8) Specify the path: e.g. c:\xairakeys\target.xml;
9) Click on OK and the target wordlist is saved;
10) Repeat steps 3-9 for the reference corpus (this time save the file as c:\xairakeys\reference.xml);
11) Start a DOS session on your Windows system and change directory to c:\xairakeys;
12) Type the following command at the DOS prompt:
xairakeys target.xml reference.xml keywords.xml

The file keywords.xml contains the keywords in the target corpus, sorted by "keyness" (based on log-likelihood scores).

I tried to compare the Callhome Mandarin (spoken, ca. 300,000 tokens, 7,479 types) with LCMC (written, 1 million tokens, 45,369 types), and it took just one minute (it may take two or three minutes on your machine). The words sitting on top of the list are all spoken words (see attachment).

Xairakeys program files: http://www.corpus4u.org/upload/forum/2006010701554946.zip

CallHome Mandarin corpus keywords:
http://www.corpus4u.org/upload/forum/2006010701564791.zip
 







那个






现在


反正





什么









这个






知道




这边

然后
使
发展

你们
这些是Callhome排在前面的keyword
 
Back
顶部