German-English Parallel Corpus


Staff member

Bilingual Formal / Informal Address Corpus

This page provides the parallel German-English text corpus used in Faruqui and Pado 2012. It consists of 106 public-domain novels and stories, mostly 19th-century texts. The texts are segmented into paragraphs, sentences and words, are aligned at the sentence level, and POS-tagged and lemmatized.
Corpus sources and licensing

The texts are taken from Project Gutenberg for English and Projekt Gutenberg-DE for German. The English texts can be used freely, including redistribution. The German texts are provided for free by Projekt Gutenberg-DE for personal use (which we assume to include academic fair use).

List of novels, authors, and original languages
Training set (74 novels, 57M)
Development set (19 novels, 17M)
Test set (13 novels, 13M)

Tools used to construct the corpus

TreeTagger: POS tagging and lemmatization for English and German
Gargantua: Unsupervised sentence alignment


Feedback is always welcome at