xiara query的listing结果保存失败,如何在xiara中根据不同pos来制作带词频的词表

I tried to extract words according to different parts of speech via xiara. For example, I want to make a wordlist of all the nouns in BNC and all the adj in BNC. There are only 378,111 hits for INTERJ in BNC which carries the least number in the search result. I did an experiment on it in order to save time. I used addkey query , chose "pos", "any"and "INTERJ", and let it run. Then I download all the hits of the INTERJ. Afterwards, I chose "listing" in the menu of "Query". It took me almost five hours to finsih the listing. After this experiment, I try to do it the same way on "adj" which is of 11,824,833 hits. Then it takes me more than five hours again. However, the listing result never comes out because it says "Memory runs out". My computer got 2G memory and the cpu is Duo. How come there is such a problem? I try in on "Noun" but it doesn't work as well. Anything with Xiara? How can I make wordlists according to pos with frequency information?
 
回复: xiara query的listing结果保存失败,如何在xiara中根据不同pos来制作带词频的词表

It's the problem with memory - just imagine how much memory would be taken up by over 100,000 concordances lines; and when there is not enough physical memory, your disk space is used for virtual memory (which is much slower).

For a frequency list of POS categories, why do it yourself? There are book-length discussions of such statistics available:

http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html
 
回复: xiara query的listing结果保存失败,如何在xiara中根据不同pos来制作带词频的词表

....It took me almost five hours to finsih the listing. After this experiment, I try to do it the same way on "adj" which is of 11,824,833 hits. Then it takes me more than five hours again....

真佩服你的耐性!
 
谢谢,前面两位高人的指导,诚然Word Frequencies in Written and Spoken English一书非常有,对我的研究受益匪浅。然而,目前我需要的是电子版的不同词性的wordlist 和词频表,用于进一步的计算机处理和词汇研究。我总不能用手去输入四五十万的单词,所以希望通过xiara或其他软件来提取不同pos的wordlist.

从xiara的query设计来看,有listing功能,但这放面xiara的速度往往比较慢,相对而然,wordsmith的搜索速度较快,但wordsmith没有直接按pos做wordlist的功能,我只能按pos="ADj"的concordance, 然后保存成txt文档,再写程序去提取wordlist和frequency。请问Word Frequencies in Written and Spoken English(2001)的数据是基于BNC word edition统计的,不知道是否可以等同于我手上的BNC word xml(2007 edition) 的数据统计?

为了得到这些数据,我对xiara和wordsmith 5.0做了不同尝试,结果发现xiara的query功能可以做listing,但它只能按pos="ADJ"的搜索结果用xml的格式记录出来.问题是对于数量较小的interjection,它可以成功记录,但对于数量较大的ADJ 和SUBST不成功,显示out of memory,所以我觉得软件还是有点问题,至于上面那位先生提到的memory方面的问题,BNC xml的安装说明书上说The Xaira program requires at least 512 MB of memory to run,而我的机器是2GB和双核的,所以,不才的我就用Xiara来做不同pos的listing,结果有的词性listing成功,有的不成功,个人感觉xiara要比wordsmith慢,而且耗资源比较多,不知道哪位高人可以指点一下笨拙的我如何去完成这项任务?(为了做上面的实验,我往往睡觉前让xiara开始工作,睡醒时希望可以梦想成真)

对于研究语料库来说,如果有合适的软件,我们就可以亲自试验得出自己所要的数据,美哉!。值得一提的是,Mike Scott做了一个BNC的wordlist (http://www.lexically.net/wordsmith/),含有五十多万词,然而我最近用他的wordsimth 5.0对BNC XML edition 做了BNC的新词表,结果含有六十多万词,用wordsmith 3.0对BNC XML edition 做BNC的wordlist,结果含有三十多万次,不知道为什么有这么大的区别?这是一个很值得讨论的问题,不同版本的wordsmith来处理同一BNC,会有这么大的区别? 同一版本的wordsmith处理不同版本的BNC,wordlist结果又不一样?如果大家有条件的话,不妨试试,我用了大概20多分钟,就可以了用wordsmith 5.o做完BNC XML edition的wordlist。wordsmith 5.0可以免费试用,但试用版没法保存结果。
 
回复: xiara query的listing结果保存失败,如何在xiara中根据不同pos来制作带词频的词表

WordSmith 和 Xaira 设计的思路和目的是有很大差别的,在一个软件里能实现的,或能够实现很好的功能,在另一个里不一定能实现。所以,选用的哪一个要和自己的研究目的相结合。
 
回复: xiara query的listing结果保存失败,如何在xiara中根据不同pos来制作带词频的词表

值得注意的是,上面提到的“For a frequency list of POS categories, why do it yourself? There are book-length discussions of such statistics available:

http://www.comp.lancs.ac.uk/ucrel/bncfreq/flists.html” ,该书的词频表做得很好,但它是基于BNC world 版本,现在用的是BNC xml 版本, 这两个版本词数目存在着一定差异,XML版本把以前旧版本中重复出现的文件删除掉了,而wordsmith tools 5.0 的TAG 设计似乎还是基于BNC world 和BNC 的老版本,基于BNC XML 新的tag文件还没有,使得对BNC XML的进一步搜索困难重重。
 
Back
顶部