1. 发垃圾贴的禁不掉,只能关闭论坛注册。想要注册账户的C友,请发邮件到 aihaiyang at gmail dot com,我手动帮你创建。

2015年新出炉中国英语学习者语料库The TECCL corpus

本文由 xujiajin2015-12-28 发表於 "学习者语料库与二语习得" 讨论区

  1. xujiajin

    xujiajin 管理员 Staff Member

    Ten-thousand English Compositions of Chinese Learners (the TECCL Corpus) Version 1.1


    Download the TECCL corpus here!

    Key information of the TECCL Corpus

    Corpus name: Ten-thousand English Compositions of Chinese Learners (the TECCL corpus) (Version 1.1)

    Text contributors: Xue, Xizhe (Romanised pinyin notation of the Chinese word "learner")

    Project initiator: Jiajin Xu (the National Research Centre for Foreign Language Education, Beijing Foreign Studies University)

    Year of corpus creation: 2015

    Formats of the corpus: Two forms of the TECCL corpus, i.e. raw texts and part-of-speech tagged texts, are available. They are stored in two folders, i.e. 01TECCL_RAW and 02TECCL_POS. The POS texts were annotated with the tag set version 7 (C7). (cf: http://ucrel.lancs.ac.uk/claws7tags.html) of the CLAWS POS tagger developed at UCREL, Lancaster University, UK.

    Citation: Xue, Xizhe. 2015. Ten-thousand English Compositions of Chinese Learners (The TECCL corpus), Version 1.1. The National Research Centre for Foreign Language Education, Beijing Foreign Studies University.

    The TECCL corpus: Its background and highlights

    The TECCL corpus contains approximately 10,000 writing samples of Chinese EFL learners, totalling 1,817,335 words (Note: We consider as words all alphanumeric strings, including hyphenated strings, represented by the regular expression [a-zA-Z0-9-]+.). Initially, 10,127 texts were sampled from an online writing and scoring system. 263 blank texts, texts written in Chinese, translated English texts, and duplicated and/or plagiarised texts were removed by hand. As a result, the finalised version of the TECCL corpus consists of 9,864 texts. All the text contributors have agreed to share their texts for future use of academic purposes while they were submitting the texts to the online system. Further anonymisation was committed to keep the possibility of writers' identity disclosure to a minimum. The sampling frame of the corpus was drawn up by Jiajin Xu, and he too undertook all the text cleaning and POS tagging. Liangping Wu, at the early stage of the project, assisted with the text cleaning.

    The TECCL corpus ‘figures prominently’ not for its size but its representativeness in the following five aspects.

    1)Unlike other Chinese learner corpora available, the TECCL corpus is more up-to-date as of 2015. The material included was produced between 2011 and 2015. The corpus was compiled to mirror the Chinese EFL learners' English of the time.

    2)The corpus features a wide range of topics or prompts. The rough estimation goes over 1,000 different essay topics.

    3)The writers in the corpus run the gamut from elementary school to postgraduate students, undergraduates being the overwhelming majority. The number of so-called 985/211 and non-985/211 universities to a large extent corresponds to the actual proportion of Chinese universities.

    4)The geographical spread of the writers in the TECCL corpus is by far the widest of all Chinese EFL learners' English corpora. The corpus encompasses text material from 32 provinces, and (autonomous) regions, including Hong Kong and Taiwan.

    5)In stark contrast to other Chinese EFL learners' English corpora, the TECCL corpus comprises both texts written in class and in testing context under (time) pressure and texts written after class. The corpus even takes in some collaborative writing samples. Most previous Chinese EFL learners' English corpora are compositions produced in high-stakes standardised English tests, such as CET-4/6, TEM4/8 and PETS.

    A known problem withtext typography

    Chinese learners have a notorious habit of typing words immediately after the commas and full stops without a space. This problem of spacing is not corrected in the final version of the corpus. Fortunately, this does not affect the computation of word tokens or the tagging of parts of speech. Users of the corpus can add a white space after the punctuations, if necessary.


    The TECCL corpus can be downloaded for personal research, but not be used for any form of commercial purposes.


    Please feel free to report any problems with the texts to bfsucrg@sina.com.

    Web-based concordancing of the TECCL corpus is enabled at BFSU CQPweb,

    More information about the corpus is available at the official site of the Corpus Research Group, National Research Centre for Foreign Language Education, Beijing Foreign Studies University, http://www.bfsu-corpus.org.








      语料版本:TECCL语料库以“生语料”和“词性赋码语料”2种格式发布,分别对应01TECCL_RAW、02TECCL_POS 2个文件夹。词性赋码采用CLAWS赋码器,所用码集为C7(详见http://ucrel.lancs.ac.uk/claws7tags.html)。

      引文格式:薛熙哲,2015,中国学生万篇英语作文语料库(V1.1)(Ten-thousand English Compositions of Chinese Learners, Version 1.1,简称The TECCL corpus)。















      TECCL语料库另部署于BFSU CQPweb,诸位可访问http://,在线检索TECCL语料库。

  2. 清风出袖

    清风出袖 高级会员

    thanks a lot for sharing this new corpus with us all!
  3. thanks for Dr. Xu's generosity
  4. Dr. Xu, thank you soooooo much for sharing the new corpus. I'm conducting my dissertion on lexical chunks based on corpus. The new corpus has helped me a great deal.Btw, Happy New Year! Best wishes for you!
  5. kevin

    kevin 初级会员

    Dr. Xu, thank you so much. The new corpus attrats me a lot. I send my warmest wishes for a very happy new year.
  6. Many thanks for this very impressive project, Dr. Xu!
    Where can we find more detailed specifications of this up-to-date corpus? The feature description part of the documentation doesn't seem detailed enough.
  7. xujiajin

    xujiajin 管理员 Staff Member

    The Excel file in the folder provides all the information of the writers and the writing tasks. A few days ago, a friend wrote to ask whether the texts could be separated according to English majors or non-English majors. My reply was negative, because I couldn't collect such data from the original database. Besides, I don't see any point of separating the groups of students. In the current educational system of China, non-English major students of good universities outperform significantly than those English majors in less good universities, don't they?
  8. Thanks a lot! The Excel document is very helpful. (I had failed to notice it :p)
    Does your team have any plan to annotate the errors? (That's extremely painstaking and prune to mistakes. CLEC was in a sense not very successful.)
  9. xujiajin

    xujiajin 管理员 Staff Member

    No. We will NEVER apply error codes to the TECCL corpus as corpus compilers. The following comments from Prof. Maocheng Liang might best account for our decision.

    "本人有幸参与了……SWECCL的建设。……语料库建设过程中我们曾得到桂诗春教授、何安平教授和英国伯明翰大学Susan Hunston教授的指导,并有机会向卫乃兴教授、李文中教授和濮建忠教授求教。


    Last edited: 2016-01-16
  10. Could Not Connect
    Description: Could not connect to the requested server host.


  11. xujiajin

    xujiajin 管理员 Staff Member

    Will contact the IT people to restore the server soon. You can try our online concordancing portal instead at . Both user id and pass are 'test'.
  12. Many thanks~
  13. xujiajin

    xujiajin 管理员 Staff Member

    The link is live now.
  14. “keywords” in the menu,when I want to calculate keywords, the page turns to "
    CQPweb encountered an error and could not continue.

    The two frequency lists you have chosen are identical!

    ... in file /usr/local/apache2/htdocs/cqp/lib/keywords.inc.php line 239."

    What does it mean? I was wondering if I did sth. wrong?
  15. Professor Xu, many thanks for sharing TECCL. I want to distinguish college writing from high scholl writing, but I cann't find any cue from the file names, would u please tell me the secret? Thx u.
  16. Hi, you could try the "Restricted Query" menu item.
  17. 许博士:TECCL中学习者基本是三类:中学生、大学生和其他,请问这个“其他”是指小学生吗?另外在大学生这个群体里,是否有大一、大二、大三、大四或者是研究生的区分,在excel 表里找不到这些信息。
    LifeApart 点赞!
  18. “其他”不是指小学生,我也在想“其他”指什么。小学生有58篇,excel文档里面有说,但其实58篇中有两篇是重复的。