JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
available
Readers on this list may be interested in the availability of the
'JRC-Acquis' parallel corpus:
SIZE AND FORMAT
- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
LANGUAGES
Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.
TEXT TYPES
- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
PARAGRAPH ALIGNMENT
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
MANUAL SUBJECT DOMAIN CLASSIFICATION
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
USE / DOWNLOAD
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
FOR MORE DETAILS
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
http://langtech.jrc.it/#Publications.
CONTACT FOR FURTHER INFORMATION
Ralf Steinberger (Ralf.Steinberger@jrc.it)
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology
URL: http://langtech.jrc.it, http://press.jrc.it/NewsExplorer
T.P. 267, Via Fermi 1
21020 Ispra (VA), Italy
Tel: +39 0332 78-6271
Fax: +39 0332 78-5154
available
Readers on this list may be interested in the availability of the
'JRC-Acquis' parallel corpus:
SIZE AND FORMAT
- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
LANGUAGES
Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.
TEXT TYPES
- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
PARAGRAPH ALIGNMENT
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
MANUAL SUBJECT DOMAIN CLASSIFICATION
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
USE / DOWNLOAD
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
FOR MORE DETAILS
Steinberger Ralf, Bruno Pouliquen, Anna Widiger, Camelia Ignat, Tomaž
Erjavec, Dan Tufiş, Dániel Varga (2006). 'The JRC-Acquis: A multilingual
aligned parallel corpus with 20+ languages'. Proceedings of the 5th
International Conference on Language Resources and Evaluation (LREC'2006).
Genoa, Italy, 24-26 May 2006. Available at
http://langtech.jrc.it/#Publications.
CONTACT FOR FURTHER INFORMATION
Ralf Steinberger (Ralf.Steinberger@jrc.it)
European Commission - Joint Research Centre (JRC)
IPSC - SeS - Language Technology
URL: http://langtech.jrc.it, http://press.jrc.it/NewsExplorer
T.P. 267, Via Fermi 1
21020 Ispra (VA), Italy
Tel: +39 0332 78-6271
Fax: +39 0332 78-5154