兰开斯特汉语语料库(The Lancaster Corpus of Mandarin Chinese,简称LCMC)是在Tony McEnery教授指导下,由他的学生肖忠华博士历时半年多于2003年6月初步建设完成的现代汉语平衡语料库。该语料库项目是由兰开斯特大学语言学系承担,由英国经社研究委员会资助设立的。LCMC语料库是严格按照Freiburg-LOB Corpus of British English(即FLOB)模式编制的汉语书面语语料库,它的建成有助于我们从事基于语料库的汉语单语或汉英(英汉)双语的对比研究。
2.0 LCMC语料库概况
LCMC是一个100万词次(按每1.6个汉字对应一个英文单词折算)的现代汉语书面语平衡语料库。起先建立时它是作为英国经社研究委员会资助项目Contrasting Tense and Aspect in English and Chinese的一部分。最初的设想便是要将其建成同FLOB和FROWN对等的现代汉语语料库。筹建这样的一个语料库的最初动因主要是:尽管已经有很多汉语语料库存在(Yang 2003),但却没有一个完全免费对公众开放的平衡的汉语语料库 。
Write to the copyright holders to ask for permissions and let them know that speaker identities (for spoken data) are made anonymous and that the data is used for academic research, not for any commercial purpose (or if you do make money, you will enter into a profit-sharing agreement). Most copyright holders I have ocntacted are very cooperative.
So here is the situation: If we want to use the indexed version of LCMC,
we need to stick to version 1.13 of Xaira. But for other corpora it'd be
preferable to use Xaira 1.14 given that it is the latest version?
If users know how to index a corpus, you can use the latest release of Xaira. But note the Collocation link is broken in 1.14. We will release 1.15 in a couple of days.
WordSmith Wordlist merges all numerals as # even those those numbers are individual word tokens in the corpus. Also the one million word tokens in the LCMC corpus include both words and symbols/puncuations, which are omitted in a WordSmith wordlist.