[New] The UCLA Chinese corpus

xiaoz · 2007-02-01

Hongyin and I are pleased to announce a brand new corpus of written Chinese -

The UCLA Chinese corpus

The UCLA Chinese Corpus is designed as a Chinese counterpart for the FLOB and Frown corpora of British and American English for contrastive research, as well as a recent update of the Lancaster Corpus of Mandarin Chinese (LCMC) for diachronic studies of possible changes in written Chinese over the past decade. Since this period is of special significance because of the impact of the Internet on language, especially on Chinese, the corpus is an excellent complement to LCMC.

The samples in the corpus are all collected from written modern Chinese available from the internet, during the period of 2000-2005, though some texts may have been converted from paper-based publications in earlier years. File types are matched as closely as possible to the Brown corpus model, with some variations (e.g. adventure fictions) to accommodate Chinese characteristics, while the proportions for different text categories may vary from the English counterparts and LCMC. Presently the genres covered and their sample sizes are shown as in the table below. Our target size is one million tokens.

The corpus is Unicode and XML-compliant. Each corpus file is composed of a corpus header and a text body. The header gives general information of a corpus file. In the body part, paragraphs, sentences and tokens are marked up, with each sentence numbered and each token annotated for part of speech.

The UCLA Chinese Corpus is a product of the joint effort of Professor Hongyin Tao (University of California Los Angeles) and Dr. Richard Xiao (UCREL of Lancaster University). Funding for this project was provided to Hongyin Tao by the UCLA Academic Senate during the academic years 2003-2005, while Richard Xiao was supported by the UK Economic and Social Research Council (Award Reference RES-000-23-0553). We are also obliged to Iris Li, Haiyong Liu, and Hui Zhang for their assistance in data collection.

The corpus is distributed free of charge for use in non-profit-making research. For licensing information, please refer to the LCMC licence. You are welcome to access the corpus using our web-based concordancer. Click here to have a look at the POS tagset.

The UCLA Chinese Corpus can be cited as: Tao, Hongyin and Richard Xiao (2007) The UCLA Chinese Corpus. UCREL, Lancaster.

Disclaimer: We give no warranties that the UCLA corpus will be suitable for any particular purpose and accept no responsibility for any technical limitations of the corpus or software.

Haiyang Ai · 2007-02-01

回复: [New] The UCLA Chinese corpus

Congrats! Yet another publically availabe Chinese corpus.

laohong · 2007-02-01

回复: [New] The UCLA Chinese corpus

热烈祝贺！

清风出袖 · 2007-02-01

回复: [New] The UCLA Chinese corpus

Thanks a lot, Dr. Xiao! More importantly "the corpus is distributed free of charge for use in non-profit-making research."

xiaoz · 2007-02-01

回复: [New] The UCLA Chinese corpus

Thanks.
laohong has loads of treasures in his toolkit.

xiaoz · 2007-02-01

回复: [New] The UCLA Chinese corpus

If you wondered, a corpus can cost a fortune -

Press Release - Immediate Paris, France, January, 18th 2007
Distribution Agreement

ELRA today signed a major Language Resources distribution agreement with Beijing Haitian Ruisheng Science Technology Ltd.

ELRA and Beijing Haitian Ruisheng Science Technology Ltd today signed a major Language Resources distribution agreement. On behalf of ELRA, ELDA will act as the distribution agency for Beijing Haitian Ruisheng Science Technology Ltd and will incorporate to the ELRA Language Resources catalogue a large number of Speech resources designed and collected to boost Speech Synthesis and Speech Recognition. The resources cover mainly Mandarin Chinese with some coverage of Korean and Japanese languages.

With over 60 new resources, ELDA is strengthening its position as the leading worldwide distribution centre. With this agreement Beijing Haitian Ruisheng Science Technology Ltd will get more visibility in particular on the European market.

List of available Speech Resources:
http://catalog.elra.info/search_result.php?keywords=s0228

List of available Written Corpora:
http://catalog.elra.info/search_result.php?keywords=w0045

armstrong · 2007-02-01

回复: [New] The UCLA Chinese corpus

Great, Dr.Tao and Dr.Xiao!

JanChang · 2013-04-16

回复: [New] The UCLA Chinese corpus

作者 Haiyang:
Congrats! Yet another publically availabe Chinese corpus.

请问为什么 http://www.lancs.ac.uk/fass/projects/corpus/UCLA/这个网页内，
“You are welcome to access the corpus using our web-based concordancer hosted at The Institute of Education, Singapore ”
跳转到实际语料的链接 web-based concordancer（http://corpus.nie.edu.sg/cgi-bin/ucla/UCLAconc.pl）打不开呢？

JanChang · 2013-04-17

回复: [New] The UCLA Chinese corpus

作者 JanChang:
请问为什么 http://www.lancs.ac.uk/fass/projects/corpus/UCLA/这个网页内，
“You are welcome to access the corpus using our web-based concordancer hosted at The Institute of Education, Singapore ”
跳转到实际语料的链接 web-based concordancer（http://corpus.nie.edu.sg/cgi-bin/ucla/UCLAconc.pl）打不开呢？

已解决。
不用回复了，谢谢。

xujiajin · 2013-05-04

回复: [New] The UCLA Chinese corpus

The second edition of the UCLA Written Chinese Corpus (UCLA2) available now at BFSU CQPweb.
http://111.200.194.212/cqp/
ID: test
Pasword: test

用户名和密码都是test

JanChang · 2013-05-09

回复: [New] The UCLA Chinese corpus

作者 xujiajin:
The second edition of the UCLA Written Chinese Corpus (UCLA2) available now at BFSU CQPweb.
http://111.200.194.212/cqp/
ID: test
Pasword: test

用户名和密码都是test

非常感谢

zhuman · 2015-03-10

回复: [New] The UCLA Chinese corpus

http://111.200.194.212/cqp/
请文为什么用户名和密码都输入了 test 但是一直显示错误无法进入呢？

xujiajin · 2015-03-12

回复: [New] The UCLA Chinese corpus

已验证，用test登陆The UCLA Corpus of Written Chinese (2nd edition)，没有任何问题。

[New] The UCLA Chinese corpus

xiaoz

永远的超级管理员

Haiyang Ai

Administrator

laohong

管理员

附件

清风出袖

高级会员

xiaoz

永远的超级管理员

xiaoz

永远的超级管理员

armstrong

高级会员

JanChang

JanChang

xujiajin

管理员

JanChang

zhuman

xujiajin

管理员