Q: What is LCMC anyway?
A: LCMC is an acronym for the Lancaster Corpus of Mandarin Chinese, a one-million-word balanced corpus of written Mandarin Chinese.
Q: What can be corpus be used for?
A: The corpus was designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB) and the Freiburg-Brown Corpus of American English (Frown). The three corpora are comparable in both sampling period and sample frame, each consisting of five hundred 2,000-word samples taken from 15 text categories published around 1991. Therefore, LCMC can be used to for cross-sectional research in modern written Chinese and in combination with FLOB/Frown, for cross-linguistic contrast between Chinese and British/American English. But some creative uses can never the envisaged by corpus builders. For example, as LCMC has a Pinyin version in addition to a standard version of Chinese character, some people have used it to train character-Pinyin conversion software.
Q: What markup and annotation have been undertaken on the corpus?
A: The LCMC corpus is marked up in XML format at five levels: text category, Sample file, paragraph, sentence and token, in addition to an informative corpus header. The data is tokenised and POS tagged, with an accuracy rate of ca. 98%.
Q: Is the corpus freely available?
A: The corpus is freely available for use in academic research and education but is charged for any commercial purposes.
Q: Where can I order a copy of the corpus?
A: The LCMC corpus is officially released by the European Language Resources Association (ELRA) and the Oxford University Press (OTA). Non-commercial users can order from a copy of the corpus from ELRA or OTA, or download it from the LCMC website. They can also access the corpus online. Commercial users must contact ELRA. Here are some useful links:
ELRA Cat. W0039: http://www.elda.org/catalogue/en/text/W0039.html
OTA Cat. No 2474: http://ota.ox.ac.uk/textinfo/2474.html
LCMC site: http://www.ling.lancs.ac.uk/corplang/lcmc/default.htm
LCMC download link: http://www.ling.lancs.ac.uk/corplang/lcmc/lcmc/license.html
LCMC WebConc: http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl
LCMC Beijing (in Chinese): http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC.htm
Q: Can I use LCMC with WordSmith 4?
A: Yes. The LCMC corpus is encoded in UTF-8. When you load the corpus, you will be prompted to convert the corpus into Unicode (UTF-16). After conversion, WordSmith works reliably on the corpus as long as you have adjusted settings properly (e.g. selecting language and specifying the markup format).
Q: Can I use LCMC with Xaira?
A: Yes. The development of Xaira has depended upon LCMC for various tests for Chinese. You can also download an indexed version of the corpus if you do not wish to index the corpus yourself, but note that the index version only works with Xaira 1.10-1.13.
A: LCMC is an acronym for the Lancaster Corpus of Mandarin Chinese, a one-million-word balanced corpus of written Mandarin Chinese.
Q: What can be corpus be used for?
A: The corpus was designed as a Chinese match of the Freiburg-LOB Corpus of British English (FLOB) and the Freiburg-Brown Corpus of American English (Frown). The three corpora are comparable in both sampling period and sample frame, each consisting of five hundred 2,000-word samples taken from 15 text categories published around 1991. Therefore, LCMC can be used to for cross-sectional research in modern written Chinese and in combination with FLOB/Frown, for cross-linguistic contrast between Chinese and British/American English. But some creative uses can never the envisaged by corpus builders. For example, as LCMC has a Pinyin version in addition to a standard version of Chinese character, some people have used it to train character-Pinyin conversion software.
Q: What markup and annotation have been undertaken on the corpus?
A: The LCMC corpus is marked up in XML format at five levels: text category, Sample file, paragraph, sentence and token, in addition to an informative corpus header. The data is tokenised and POS tagged, with an accuracy rate of ca. 98%.
Q: Is the corpus freely available?
A: The corpus is freely available for use in academic research and education but is charged for any commercial purposes.
Q: Where can I order a copy of the corpus?
A: The LCMC corpus is officially released by the European Language Resources Association (ELRA) and the Oxford University Press (OTA). Non-commercial users can order from a copy of the corpus from ELRA or OTA, or download it from the LCMC website. They can also access the corpus online. Commercial users must contact ELRA. Here are some useful links:
ELRA Cat. W0039: http://www.elda.org/catalogue/en/text/W0039.html
OTA Cat. No 2474: http://ota.ox.ac.uk/textinfo/2474.html
LCMC site: http://www.ling.lancs.ac.uk/corplang/lcmc/default.htm
LCMC download link: http://www.ling.lancs.ac.uk/corplang/lcmc/lcmc/license.html
LCMC WebConc: http://www.ling.lancs.ac.uk/corplang/cgi-bin/conc.pl
LCMC Beijing (in Chinese): http://www.cass.net.cn/chinese/s18_yys/dangdai/LCMC/LCMC.htm
Q: Can I use LCMC with WordSmith 4?
A: Yes. The LCMC corpus is encoded in UTF-8. When you load the corpus, you will be prompted to convert the corpus into Unicode (UTF-16). After conversion, WordSmith works reliably on the corpus as long as you have adjusted settings properly (e.g. selecting language and specifying the markup format).
Q: Can I use LCMC with Xaira?
A: Yes. The development of Xaira has depended upon LCMC for various tests for Chinese. You can also download an indexed version of the corpus if you do not wish to index the corpus yourself, but note that the index version only works with Xaira 1.10-1.13.