[FONT=宋体]翻译语料库方面则以英国曼彻斯特大学科技学院[/FONT](UMIST) [FONT=宋体]翻译研究中心[/FONT]1995[FONT=宋体]年创建的世界上第一个翻译语料库[/FONT]( Translational EnglishCorpus , TEC) [FONT=宋体]最为著名。该语料库主要收集从各国语言翻译成英语的文本[/FONT],[FONT=宋体]目前已有上千万词的语料[/FONT]([FONT=宋体]目标是[/FONT]5 [FONT=宋体]千万词[/FONT]) ,[FONT=宋体]分小说[/FONT]([FONT=宋体]约占[/FONT]80 %) [FONT=宋体]传记、报纸和期刊[/FONT]4 [FONT=宋体]个子库。它并不要求必须双语对齐。[/FONT]
[FONT=宋体]该库不仅对语料进行了附码标注[/FONT],[FONT=宋体]还带有许多超语言信息的标注[/FONT],[FONT=宋体]如对译者情况[/FONT]([FONT=宋体]包括译者姓名、性别、民族、职业、翻译方向等[/FONT]) [FONT=宋体]、翻译方式、翻译类型、源语、原书情[/FONT]? 6 3 ?© 1995-2004 Tsinghua Tongfang Optical Disc Co., Ltd. All rights reserved.[FONT=宋体]况、出版社等等均一一予以标注[/FONT]
[FONT=宋体]网址为[/FONT] http://www.fleric.org.cn/ceo/
-- The Babel English-Chinese Parallel Corpus
[FONT=宋体]The Babel English-Chinese Parallel Corpus,which was created on our research project Contrasting English and Chinese (ESRC Award Reference RES-000-23-0553),consists of 327 English articles and their translations in Mandarin Chinese. Of these 115 texts (121,493 English tokens plus 135,493 Chinese tokens) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English tokens plus 151,969 Chinese tokens) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese tokens). Here is a list of the titles of the articles included in the corpus.[/FONT]
[FONT=宋体]The corpus is tagged for part of speech and aligned at the sentence level. The English texts were tagged using the [/FONT][FONT=宋体]CLAWS C7 tagset[/FONT][FONT=宋体] while Chinese texts were tagged using the [/FONT][FONT=宋体]Peking University tagset[/FONT][FONT=宋体]. Sentence alignment was done automatically and corrected by hand. The corpus is also marked for paragraph and sentence. But different markup systems were adopted for the two subcorpora. For the component of the World of English, sentences were marked consecutively throughout whereas for Time, sentences were marked within each paragraph.[/FONT]
[FONT=宋体]The Babel parallel corpus can be accessed via the ParaConc Web or MySql interface (both hosted at [/FONT][FONT=宋体]The Institute of Education, Singapore[/FONT][FONT=宋体]). Users can search in either English or Chinese texts. The concordancer returns matched whole sentences and their translations as well as the their locations. At the bottom of the resulting concordance page is a query report that indicate the query strings and distribution of matches. Users can also specify the format the output concordances as POS-tagged or plain texts.[/FONT]
--[FONT=宋体]上海交通大学语言工程研究所目前有[/FONT]JDEST,LOB,BROWN,CLEC[FONT=宋体]四个语料库共计[/FONT]700[FONT=宋体]万词可供网上检索[/FONT], [FONT=宋体]并可以对检索和统计数据结果下载[/FONT].
--The Translational English Corpus (TEC)
--English Chinese Parallel Concordancer (E-C Concord)
The Hong Kong Institute of Education.
Project leader: Dr. Wang Lixun. Program designers: Chris Greaves, Wang Lixun
--Lancaster Corpus of Mandarin Chinese
may be changed to
--A Parallel Corpus of Chinese Legal Texts [FONT=宋体]中國法律文件漢英平行語料庫[/FONT]
[FONT=宋体]翻译语料库方面则以英国曼彻斯特大学科技学院[/FONT](UMIST) [FONT=宋体]翻译研究中心[/FONT]1995[FONT=宋体]年创建的世界上第一个翻译语料库[/FONT]( Translational EnglishCorpus , TEC) [FONT=宋体]最为著名。该语料库主要收集从各国语言翻译成英语的文本[/FONT],[FONT=宋体]目前已有上千万词的语料[/FONT]([FONT=宋体]目标是[/FONT]5 [FONT=宋体]千万词[/FONT]) ,[FONT=宋体]分小说[/FONT]([FONT=宋体]约占[/FONT]80 %) [FONT=宋体]传记、报纸和期刊[/FONT]4 [FONT=宋体]个子库。它并不要求必须双语对齐。[/FONT]
[FONT=宋体]该库不仅对语料进行了附码标注[/FONT],[FONT=宋体]还带有许多超语言信息的标注[/FONT],[FONT=宋体]如对译者情况[/FONT]([FONT=宋体]包括译者姓名、性别、民族、职业、翻译方向等[/FONT]) [FONT=宋体]、翻译方式、翻译类型、源语、原书情[/FONT]? 6 3 ?© 1995-2004 Tsinghua Tongfang Optical Disc Co., Ltd. All rights reserved.[FONT=宋体]况、出版社等等均一一予以标注[/FONT]
[FONT=宋体]网址为[/FONT] http://www.fleric.org.cn/ceo/
-- The Babel English-Chinese Parallel Corpus
[FONT=宋体]The Babel English-Chinese Parallel Corpus,which was created on our research project Contrasting English and Chinese (ESRC Award Reference RES-000-23-0553),consists of 327 English articles and their translations in Mandarin Chinese. Of these 115 texts (121,493 English tokens plus 135,493 Chinese tokens) were collected from the World of English between October 2000 and February 2001 while the remaining 212 texts (132,140 English tokens plus 151,969 Chinese tokens) were collected from Time from September 2000 to January 2001. The corpus contains a total of 544,095 words (253,633 English words and 287,462 Chinese tokens). Here is a list of the titles of the articles included in the corpus.[/FONT]
[FONT=宋体]The corpus is tagged for part of speech and aligned at the sentence level. The English texts were tagged using the [/FONT][FONT=宋体]CLAWS C7 tagset[/FONT][FONT=宋体] while Chinese texts were tagged using the [/FONT][FONT=宋体]Peking University tagset[/FONT][FONT=宋体]. Sentence alignment was done automatically and corrected by hand. The corpus is also marked for paragraph and sentence. But different markup systems were adopted for the two subcorpora. For the component of the World of English, sentences were marked consecutively throughout whereas for Time, sentences were marked within each paragraph.[/FONT]
[FONT=宋体]The Babel parallel corpus can be accessed via the ParaConc Web or MySql interface (both hosted at [/FONT][FONT=宋体]The Institute of Education, Singapore[/FONT][FONT=宋体]). Users can search in either English or Chinese texts. The concordancer returns matched whole sentences and their translations as well as the their locations. At the bottom of the resulting concordance page is a query report that indicate the query strings and distribution of matches. Users can also specify the format the output concordances as POS-tagged or plain texts.[/FONT]
--[FONT=宋体]上海交通大学语言工程研究所目前有[/FONT]JDEST,LOB,BROWN,CLEC[FONT=宋体]四个语料库共计[/FONT]700[FONT=宋体]万词可供网上检索[/FONT], [FONT=宋体]并可以对检索和统计数据结果下载[/FONT].
--The Translational English Corpus (TEC)
--English Chinese Parallel Concordancer (E-C Concord)
The Hong Kong Institute of Education.
Project leader: Dr. Wang Lixun. Program designers: Chris Greaves, Wang Lixun
--Academia Sinica Balanced Corpus of Modern Chinese [FONT=宋体]中央研究院现代汉语平衡语料库[/FONT]
http://www.sinica.edu.tw/SinicaCorpus/--Lancaster Corpus of Mandarin Chinese
may be changed to
--People's Daily 2000 corpus
some related information here http://www.lancs.ac.uk/fass/projects/corpus/pdc2000/default.htm
--A Parallel Corpus of Chinese Legal Texts [FONT=宋体]中國法律文件漢英平行語料庫[/FONT]
Last edited: