http://www.nlpado.de/~sebastian/data/tv_data.shtml
Bilingual Formal / Informal Address Corpus
This page provides the parallel German-English text corpus used in Faruqui and Pado 2012. It consists of 106 public-domain novels and stories, mostly 19th-century texts. The texts are segmented into paragraphs, sentences and words, are aligned at the sentence level, and POS-tagged and lemmatized.
Corpus sources and licensing
The texts are taken from Project Gutenberg for English and Projekt Gutenberg-DE for German. The English texts can be used freely, including redistribution. The German texts are provided for free by Projekt Gutenberg-DE for personal use (which we assume to include academic fair use).
Download
List of novels, authors, and original languages
README
Training set (74 novels, 57M)
Development set (19 novels, 17M)
Test set (13 novels, 13M)
Tools used to construct the corpus
TreeTagger: POS tagging and lemmatization for English and German
Gargantua: Unsupervised sentence alignment
Contact
Feedback is always welcome at sebastian%40nlpado.de.
Bilingual Formal / Informal Address Corpus
This page provides the parallel German-English text corpus used in Faruqui and Pado 2012. It consists of 106 public-domain novels and stories, mostly 19th-century texts. The texts are segmented into paragraphs, sentences and words, are aligned at the sentence level, and POS-tagged and lemmatized.
Corpus sources and licensing
The texts are taken from Project Gutenberg for English and Projekt Gutenberg-DE for German. The English texts can be used freely, including redistribution. The German texts are provided for free by Projekt Gutenberg-DE for personal use (which we assume to include academic fair use).
Download
List of novels, authors, and original languages
README
Training set (74 novels, 57M)
Development set (19 novels, 17M)
Test set (13 novels, 13M)
Tools used to construct the corpus
TreeTagger: POS tagging and lemmatization for English and German
Gargantua: Unsupervised sentence alignment
Contact
Feedback is always welcome at sebastian%40nlpado.de.