以下是引用 动态语法 在 2005-7-29 13:58:11 的发言:
The search results show file names to be something like
1 utfbifile0.txt_0004
2 utfbifile0.txt_0015
3 utfbifile0.txt_0100
4 utfbifile0.txt_0134
141 utfbifile1.txt_48_4_3
142 utfbifile2.txt_1_1_2
Obviously there is a text taxonomy used to categorize the 327
English articles there. What are the categories, how many of them
are there, and what do they roughly mean?
Any information would be greatly appreciated.
以下是引用 xujiajin 在 2005-7-29 16:44:35 的发言:
Judging from the filenames, one possibility is that there is no taxonomy of the texts but the consecutive numbering of the 327 articles.
以下是引用 oscar3 在 2005-7-29 16:50:38 的发言:
So, do you imply that this parallel corpus is not representative?
以下是引用 xiaoz 在 2005-7-29 20:56:07 的发言:
I do not claim that this corpus is representative, but it is certainly more balanced than many other E-C parallel corpora which include only news texts or literary works, because of its wide coverage of topics and domains.
See K. Wang for a better balanced more representative parallel corpus.
以下是引用 oscar3 在 2005-7-29 16:50:38 的发言:
So, do you imply that this parallel corpus is not representative?