2015年新出炉中国英语学习者语料库The TECCL corpus

本文由 xujiajin2015-12-28 发表於 "学习者语料库与二语习得" 讨论区

  1. xujiajin

    xujiajin 管理员 Staff Member

    Ten-thousand English Compositions of Chinese Learners (the TECCL Corpus) Version 1.1

    2015-12-28

    Download the TECCL corpus here!

    Key information of the TECCL Corpus

    Corpus name: Ten-thousand English Compositions of Chinese Learners (the TECCL corpus) (Version 1.1)

    Text contributors: Xue, Xizhe (Romanised pinyin notation of the Chinese word "learner")

    Project initiator: Jiajin Xu (the National Research Centre for Foreign Language Education, Beijing Foreign Studies University)

    Year of corpus creation: 2015

    Formats of the corpus: Two forms of the TECCL corpus, i.e. raw texts and part-of-speech tagged texts, are available. They are stored in two folders, i.e. 01TECCL_RAW and 02TECCL_POS. The POS texts were annotated with the tag set version 7 (C7). (cf: http://ucrel.lancs.ac.uk/claws7tags.html) of the CLAWS POS tagger developed at UCREL, Lancaster University, UK.

    Citation: Xue, Xizhe. 2015. Ten-thousand English Compositions of Chinese Learners (The TECCL corpus), Version 1.1. The National Research Centre for Foreign Language Education, Beijing Foreign Studies University.


    The TECCL corpus: Its background and highlights

    The TECCL corpus contains approximately 10,000 writing samples of Chinese EFL learners, totalling 1,817,335 words (Note: We consider as words all alphanumeric strings, including hyphenated strings, represented by the regular expression [a-zA-Z0-9-]+.). Initially, 10,127 texts were sampled from an online writing and scoring system. 263 blank texts, texts written in Chinese, translated English texts, and duplicated and/or plagiarised texts were removed by hand. As a result, the finalised version of the TECCL corpus consists of 9,864 texts. All the text contributors have agreed to share their texts for future use of academic purposes while they were submitting the texts to the online system. Further anonymisation was committed to keep the possibility of writers' identity disclosure to a minimum. The sampling frame of the corpus was drawn up by Jiajin Xu, and he too undertook all the text cleaning and POS tagging. Liangping Wu, at the early stage of the project, assisted with the text cleaning.

    The TECCL corpus ‘figures prominently’ not for its size but its representativeness in the following five aspects.

    1)Unlike other Chinese learner corpora available, the TECCL corpus is more up-to-date as of 2015. The material included was produced between 2011 and 2015. The corpus was compiled to mirror the Chinese EFL learners' English of the time.

    2)The corpus features a wide range of topics or prompts. The rough estimation goes over 1,000 different essay topics.

    3)The writers in the corpus run the gamut from elementary school to postgraduate students, undergraduates being the overwhelming majority. The number of so-called 985/211 and non-985/211 universities to a large extent corresponds to the actual proportion of Chinese universities.

    4)The geographical spread of the writers in the TECCL corpus is by far the widest of all Chinese EFL learners' English corpora. The corpus encompasses text material from 32 provinces, and (autonomous) regions, including Hong Kong and Taiwan.

    5)In stark contrast to other Chinese EFL learners' English corpora, the TECCL corpus comprises both texts written in class and in testing context under (time) pressure and texts written after class. The corpus even takes in some collaborative writing samples. Most previous Chinese EFL learners' English corpora are compositions produced in high-stakes standardised English tests, such as CET-4/6, TEM4/8 and PETS.


    A known problem withtext typography

    Chinese learners have a notorious habit of typing words immediately after the commas and full stops without a space. This problem of spacing is not corrected in the final version of the corpus. Fortunately, this does not affect the computation of word tokens or the tagging of parts of speech. Users of the corpus can add a white space after the punctuations, if necessary.


    Disclaimer

    The TECCL corpus can be downloaded for personal research, but not be used for any form of commercial purposes.


    Contact

    Please feel free to report any problems with the texts to bfsucrg@sina.com.

    Web-based concordancing of the TECCL corpus is enabled at BFSU CQPweb, http://111.200.194.212/cqp/.

    More information about the corpus is available at the official site of the Corpus Research Group, National Research Centre for Foreign Language Education, Beijing Foreign Studies University, http://www.bfsu-corpus.org.


    “中国学生万篇英语作文语料库(V1.1)”说明文档

    (2015-12-28)


    TECCL语料库基本信息

      中文名称:中国学生万篇英语作文语料库(V1.1)

      语料提供:薛熙哲

      策划整理:许家金

      创建年份:2015

      语料版本:TECCL语料库以“生语料”和“词性赋码语料”2种格式发布,分别对应01TECCL_RAW、02TECCL_POS 2个文件夹。词性赋码采用CLAWS赋码器,所用码集为C7(详见http://ucrel.lancs.ac.uk/claws7tags.html)。

      引文格式:薛熙哲,2015,中国学生万篇英语作文语料库(V1.1)(Ten-thousand English Compositions of Chinese Learners, Version 1.1,简称The TECCL corpus)。


    TECCL语料库创建的背景及特色

      TECCL语料库规模约为1万篇作文,1,817,335词(按:单词定义为:[a-zA-Z0-9-]+)。语料收集之初,共计10,127篇作文,经删除空文档,中文文档,翻译作业,雷同作文,以及明显超出学习者水平的文本后,余下9,864篇。所有语料来源于某在线作文评改系统。TECCL所收作文均已获原作者授权。TECCL建库时作了进一步匿名处理。该语料库的文本采集方案由许家金拟定,后期语料清理加工、标注由许家金完成。其间得到吴良平老师的协助。


      TECCL语料库规模不大,但取样分布代表性较好。TECCL语料库的特色可概括如下:

      1)语料新。所有语料产生于2011-2015年。

      2)题目多。粗略统计,TECCL语料库中涉及的不同作文题逾千个。

      3)学段宽。所收作文涵盖大学、中学、小学三个学段,其中以大学为最多。985、211和非985、211高校的收录比例,与我国高校的实际构成接近。

      4)地域广。语料来源于包括香港、台湾在内的32个省市自治区和特别行政区。

      5)任务活。写作任务类型包括课堂限时作文、课后家庭作业、期中期末考试作文,为课堂演讲而准备的讲稿,以及小组协作作文等。属于英语课程体系内的学业任务,而不是高风险的标准化考试作文。在这一点上,TECCL语料库明显不同于以往国内建成的英语学习者语料库(如,公共英语四六级CET考试作文及口试语料库,英语专业四八级TEM作文及口试语料库,以及公共英语等级考试PETS语料库)。


    说明

      语料文本中,标点后无空格现象突出。这一点语料库发布时,未作修正。这反映了我国英语使用者对词间空格不敏感。标点后无空格,并不影响词数计算,也不会干扰词性赋码。如有必要,语料库使用者可通过查找替换,自行添加空格。


    声明

      该语料库只可作学术研究之用,不得用于任何形式的商业活动。


    联系

      语料中不合用之处,已尽力清理。若发现其他问题,请联系:bfsucrg@sina.com。

      TECCL语料库另部署于BFSU CQPweb,诸位可访问http://111.200.194.212/cqp/,在线检索TECCL语料库。

      更多语料库相关信息,可访问北京外国语大学语料库语言学团队网站:http://www.bfsu-corpus.org
     
  2. 清风出袖

    清风出袖 高级会员

    thanks a lot for sharing this new corpus with us all!
     
  3. thanks for Dr. Xu's generosity
     
  4. Dr. Xu, thank you soooooo much for sharing the new corpus. I'm conducting my dissertion on lexical chunks based on corpus. The new corpus has helped me a great deal.Btw, Happy New Year! Best wishes for you!
     
  5. kevin

    kevin 初级会员

    Dr. Xu, thank you so much. The new corpus attrats me a lot. I send my warmest wishes for a very happy new year.
     
  6. Many thanks for this very impressive project, Dr. Xu!
    Where can we find more detailed specifications of this up-to-date corpus? The feature description part of the documentation doesn't seem detailed enough.
     
  7. xujiajin

    xujiajin 管理员 Staff Member

    The Excel file in the folder provides all the information of the writers and the writing tasks. A few days ago, a friend wrote to ask whether the texts could be separated according to English majors or non-English majors. My reply was negative, because I couldn't collect such data from the original database. Besides, I don't see any point of separating the groups of students. In the current educational system of China, non-English major students of good universities outperform significantly than those English majors in less good universities, don't they?
     
  8. Thanks a lot! The Excel document is very helpful. (I had failed to notice it :p)
    Does your team have any plan to annotate the errors? (That's extremely painstaking and prune to mistakes. CLEC was in a sense not very successful.)
     
  9. xujiajin

    xujiajin 管理员 Staff Member

    No. We will NEVER apply error codes to the TECCL corpus as corpus compilers. The following comments from Prof. Maocheng Liang might best account for our decision.

    "本人有幸参与了……SWECCL的建设。……语料库建设过程中我们曾得到桂诗春教授、何安平教授和英国伯明翰大学Susan Hunston教授的指导,并有机会向卫乃兴教授、李文中教授和濮建忠教授求教。

    ……桂诗春教授结合自己创建……CLEC的经验,指出错误标注不仅是一个耗时费力的过程,而且不同的标注者对错误的认定很难取得一致。Hunston教授更是不赞成错误标注,认为我们应该保持文本的原样,其他几位教授也提出了相同或相似的观点。现在回顾起来看,幸亏当时我们征求了几位专家的意见,否则我们会陷入错误标注的泥潭之中。我现在的观点是,……标注问题,特别是错误标注问题,应该留给研究者自己去完成。毕竟,由于研究目的不同,不同研究者对错误的认识和分类也会大相径庭,语料库建设者不可能设计出一个可以满足不同研究目的的标注方案。总之,对文本的标注要十分慎重,还需要充分考虑研究的目的。"

    以上文字引自《语料库语言学》杂志2015年第2期。稍后我会上传论文完整版。
     
    Last edited: 2016-01-16
  10. Could Not Connect
    Description: Could not connect to the requested server host.

    点击下载链接,出现的这个,请问如何解决呢?

     
  11. xujiajin

    xujiajin 管理员 Staff Member

    Will contact the IT people to restore the server soon. You can try our online concordancing portal instead at http://111.200.194.212/cqp/teccl/ . Both user id and pass are 'test'.
     
  12. Many thanks~
     
  13. xujiajin

    xujiajin 管理员 Staff Member

    The link is live now.
    http://www.bfsu-corpus.org/content/teccl-corpus
     
  14. “keywords” in the menu,when I want to calculate keywords, the page turns to "
    CQPweb encountered an error and could not continue.

    The two frequency lists you have chosen are identical!

    ... in file /usr/local/apache2/htdocs/cqp/lib/keywords.inc.php line 239."

    What does it mean? I was wondering if I did sth. wrong?
     
  15. Professor Xu, many thanks for sharing TECCL. I want to distinguish college writing from high scholl writing, but I cann't find any cue from the file names, would u please tell me the secret? Thx u.
     
  16. Hi, you could try the "Restricted Query" menu item.