6-language parallel corpus

xiaoz

永远的超级管理员
Staff member
A new corpus has just been made available during Machine Translation
Summit XII conference. Some of you might be interested in it as well.
The corpus and related paper are now available from: http://www.uncorpora.org .
Some basic stats:
*) 6 languages, perfectly aligned on paragraph level: Arabic, Chinese,
English, French, Russian, Spanish
*) ~74000 paragraphs (* 6 languages)
*) ~3M tokens per language
*) Derived from the resolutions of the General Assembly of the United Nations.
*) The corpus is released in TMX (Translation Memory eXchange) form,
ready for processing with Open Source tools like Olifant or by
commercial tools like Trados.
With 3 million tokens per language, the corpus is somewhat small to be
a primary corpus for Machine Translation research, but it could be
useful as a supplementary one, especially for less-resourced languages
like Arabic, Chinese, Russian.
It is also suitable for terminology extraction, named entity
recognition, graph-based analysis techniques and other approaches
interesting within restricted-domain corpus.
 
回复: 6-language parallel corpus

Thanks for xiaoz's intro of these good resources, but it's a long way to go before it could be a functionally accessible web-based multilingual corpus.
 
Back
顶部