1. JRC-Acquis: a large aligned parallel corpus in 21 languages, freely
available
SIZE AND FORMAT
- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
LANGUAGES
Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.
TEXT TYPES
- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
PARAGRAPH ALIGNMENT
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
MANUAL SUBJECT DOMAIN CLASSIFICATION
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
USE / DOWNLOAD
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
2. HOLJ Corpus built in the framework of the SUM project in
Edinburgh (http://www.ltg.ed.ac.uk/SUM/index.html).
It contains court decisions by the House of Lords, is annotated and can
be downloaded for free.
available
SIZE AND FORMAT
- 21 languages (all 20 official EU languages plus Romanian)
- Average corpus size: 8.8 million words per language
- XML Format according to TEI P4, UTF-8-encoded
- Modular: download the languages you need.
LANGUAGES
Czech, Danish, Dutch, English, Estonian, German, Greek, Finnish, French,
Hungarian, Italian, Latvian, Lithuanian, Maltese, Polish, Portuguese,
Romanian, Slovak, Slovene, Spanish, Swedish.
TEXT TYPES
- Documents on contents, principles and political objectives of the EU
Treaties
- EU legislation
- Declarations
- Resolutions
- Acts
- International agreements.
PARAGRAPH ALIGNMENT
- Paragraph-aligned for all 210 language pairs
- Paragraphs are sentence parts, sentences, or groups of sentences
- 2 alternative alignments: using Vanilla and HunAlign
- Ca. 270,000 alignments per language pair.
MANUAL SUBJECT DOMAIN CLASSIFICATION
- Manually classified according to EUROVOC subject domains
- Selected from 6000 hierarchically organised classes, wide-coverage.
USE / DOWNLOAD
- Download from http://langtech.jrc.it/JRC-Acquis.html
- Usage free for research purposes.
2. HOLJ Corpus built in the framework of the SUM project in
Edinburgh (http://www.ltg.ed.ac.uk/SUM/index.html).
It contains court decisions by the House of Lords, is annotated and can
be downloaded for free.