Babel Parallel E-C corpus

xiaoz

永远的超级管理员
Staff member
The Babel English-Chinese parallel corpus is now freely available at

http://bowland-files.lancs.ac.uk/corplang/babel/babel.htm

which allows online parallel concordancing.
 
回复:Babel Parallel E-C corpus

The search results show file names to be something like

1 utfbifile0.txt_0004
2 utfbifile0.txt_0015
3 utfbifile0.txt_0100
4 utfbifile0.txt_0134
141 utfbifile1.txt_48_4_3
142 utfbifile2.txt_1_1_2

Obviously there is a text taxonomy used to categorize the 327
English articles there. What are the categories, how many of them
are there, and what do they roughly mean?

Any information would be greatly appreciated.
 
Only the titles of the 327 English articles and their translations in Chinese are listed but not the text categories.
http://bowland-files.lancs.ac.uk/corplang/babel/titles.htm
 
Judging from the filenames, one possibility is that there is no taxonomy of the texts but the consecutive numbering of the 327 articles.
 
回复:Babel Parallel E-C corpus

So, do you imply that this parallel corpus is not representative?
 
回复:Babel Parallel E-C corpus

Those lables are unique sentence identifiers for easy reference. the numerals indicate the hit number, followed by filenames and sentence numbers within each file.

utfbfile0 is a large file including all articles from the World of English, while the other 4 files include articles from Times.

No taxonomy was used. All files are grouped according to their origins.


以下是引用 动态语法2005-7-29 13:58:11 的发言:
The search results show file names to be something like

1 utfbifile0.txt_0004
2 utfbifile0.txt_0015
3 utfbifile0.txt_0100
4 utfbifile0.txt_0134
141 utfbifile1.txt_48_4_3
142 utfbifile2.txt_1_1_2

Obviously there is a text taxonomy used to categorize the 327
English articles there. What are the categories, how many of them
are there, and what do they roughly mean?

Any information would be greatly appreciated.
 
回复:Babel Parallel E-C corpus

No taxonomy. The prefixing numbers indicate hit number. the ending numbers show sentence numbers in a file.


以下是引用 xujiajin2005-7-29 16:44:35 的发言:
Judging from the filenames, one possibility is that there is no taxonomy of the texts but the consecutive numbering of the 327 articles.
 
回复:Babel Parallel E-C corpus

I do not claim that this corpus is representative, but it is certainly more balanced than many other E-C parallel corpora which include only news texts or literary works, because of its wide coverage of topics and domains.

See K. Wang for a better balanced more representative parallel corpus.


以下是引用 oscar32005-7-29 16:50:38 的发言:
So, do you imply that this parallel corpus is not representative?
 
回复:Babel Parallel E-C corpus

以下是引用 xiaoz2005-7-29 20:56:07 的发言:
I do not claim that this corpus is representative, but it is certainly more balanced than many other E-C parallel corpora which include only news texts or literary works, because of its wide coverage of topics and domains.

See K. Wang for a better balanced more representative parallel corpus.


以下是引用 oscar32005-7-29 16:50:38 的发言:
So, do you imply that this parallel corpus is not representative?

It looks quite representative to me, at least for the written genres. That's
part of the reason that I thought there was a taxonomy used in classifying the
texts there. Thanks for the clarification.

This is a very well constructed corpus. I hope more people will take advantage of this
great resource.
 
Back
顶部