All you want to know about LCMC

xiaoz

永远的超级管理员
Staff member
Q: What is LCMC anyway?
A: LCMC is an acronym for the Lancaster Corpus of Mandarin Chinese, a one-million-word balanced corpus of written Mandarin Chinese.

Q: What can be corpus be used for?
A: The corpus was designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB) and the Freiburg-Brown Corpus of American English (Frown). The three corpora are comparable in both sampling period and sample frame, each consisting of five hundred 2,000-word samples taken from 15 text categories published around 1991. Therefore, LCMC can be used to for cross-sectional research in modern written Chinese and in combination with FLOB/Frown, for cross-linguistic contrast between Chinese and British/American English. But some creative uses can never the envisaged by corpus builders. For example, as LCMC has a Pinyin version in addition to a standard version of Chinese character, some people have used it to train character-Pinyin conversion software.

Q: What markup and annotation have been undertaken on the corpus?
A: The LCMC corpus is marked up in XML format at five levels: text category, Sample file, paragraph, sentence and token, in addition to an informative corpus header. The data is tokenised and POS tagged, with an accuracy rate of ca. 98%.

Q: Is the corpus freely available?
A: The corpus is freely available for use in academic research and education but is charged for any commercial purposes.

Q: Where can I order a copy of the corpus?
A: The LCMC corpus is officially released by the European Language Resources Association (ELRA) and the Oxford University Press (OTA). Non-commercial users can order from a copy of the corpus from ELRA or OTA, or download it from the LCMC website. They can also access the corpus online. Commercial users must contact ELRA. Here are some useful links:

ELRA Cat. W0039: http://www.elda.org/catalogue/en/text/W0039.html
OTA Cat. No 2474: http://ota.ox.ac.uk/textinfo/2474.html
LCMC site: http://www.ling.lancs.ac.uk/corplang/lcmc/default.htm
LCMC download link: http://www.ling.lancs.ac.uk/corplang/lcmc/lcmc/license.html
LCMC WebConc: http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl
LCMC Beijing (in Chinese): http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC.htm

Q: Can I use LCMC with WordSmith 4?
A: Yes. The LCMC corpus is encoded in UTF-8. When you load the corpus, you will be prompted to convert the corpus into Unicode (UTF-16). After conversion, WordSmith works reliably on the corpus as long as you have adjusted settings properly (e.g. selecting language and specifying the markup format).

Q: Can I use LCMC with Xaira?
A: Yes. The development of Xaira has depended upon LCMC for various tests for Chinese. You can also download an indexed version of the corpus if you do not wish to index the corpus yourself, but note that the index version only works with Xaira 1.10-1.13.
 
中文版
Q: LCMC为何物?
A: LCMC是一个百万词级的现代汉语书面语平衡语料库。它的英文名称是the Lancaster Corpus of Mandarin Chinese, LCMC是它的首字母缩略形式。这里有LCMC的详细的中文介绍。
http://www.corpus4u.org/upload/forum/2005061501071124.doc

Q:该语料库能做什么用?
A: LCMC是按照同FLOB和FROWN对等的方式设计创建的现代汉语语料库。这3个语料库在取样时间跨度、取样类型方面都是对等的:它们分别由1991年前后出版的15大类别文体中选取的500篇2000词左右的文本组成。因此,LCMC可以用于现代汉语书面语同现代的英式英语和美式英语进行横向比较。但是对LCMC的创造性开发和使用远不是语料库始建者所能料想到得的。比如说,LCMC除了有一个标准的汉字版本之外,它还有一个拼音版本,有些人便利用这个拼音版本进行“汉字-拼音”转换软件的训练。

Q: 如何可以获得LCMC语料库?
Q: LCMC语料库进行怎样的标注和附码?
A: LCMC语料库以XML格式进行了5个层次的标注:文本类别, 各类文本中的样本文本, 段落, 句子和词, 以及包含其他信息的头文件。语料做了分词处理并进行了词性标注(准确率达98%)。

Q: LCMC语料库可以免费获取吗?
A: LCMC语料库可以免费用于学术和教育用途,但是对于商业用途的使用则需要收费。

Q: 如何可以获得LCMC语料库?
A: LCMC语料库由欧洲语言资源协会 (ELRA)和剑桥大学出版社 (OTA)正式发布。非商业用途用户从ELRA或OTA定制,也可从LCMC网站下载。商业用户则需与ELRA联系。以下是一些常用链接:
ELRA Cat. W0039: http://www.elda.org/catalogue/en/text/W0039.html
OTA Cat. No 2474: http://ota.ox.ac.uk/textinfo/2474.html
LCMC 主站: http://www.ling.lancs.ac.uk/corplang/lcmc/default.htm
LCMC 下载链接: http://www.ling.lancs.ac.uk/corplang/lcmc/lcmc/license.html
LCMC 在线检索: http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl
LCMC 中文镜像网站: http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC.htm

Q: 可否用WordSmith 4对LCMC进行处理?
A: 当然可以。LCMC的编码形式是UTF-8。在WS4中加载LCMC语料文本时,软件会提示语料编码格式需转换成Unicode (UTF-16)的形式。转换之后,只要其他设置妥当(比如,语言选择设定合标记格式),WS4可以十分稳定地处理LCMC。

Q: 我可以用Xaira处理LCMC吗?
A: 可以。Xaira开发过程中时曾广泛利用LCMC进行了对汉语适应性的测试。另外,如果您不愿自己对语料进行索引处理的话,您可以下载一份索引过的语料。但是请注意索引版的语料库只适用于Xaira 1.10-1.13.
 
Back
顶部