The Leipzig Corpora Collection


The Leipzig Corpora Collection presents corpora in different languages using
the same format and comparable sources. The following Languages are included:
Catalan, Danish, Dutch, English, Estonian, Finnish, French, German, Italian,
Japanese, Korean, Norwegian, Sorbian, Swedish, and Turkish.

There is an online interface at . Moreover, all
data are available as plain text and as MySQL database tables for various
applications. The corpora are ready to use with the Corpus Browser, see . The corpora are intended both for
scientific use by the corpus linguist as well as for applications such as
knowledge extraction programs.

The corpora are identical in format and similar in size and content. They
contain randomly selected sentences in the language of the corpus and are
available in sizes of 100,000 sentences, 300,000 sentences, 1 million sentences
etc. The sources are either newspaper texts or texts randomly collected from
the web. The texts are split into sentences. Non-sentences and foreign language
material was removed.

As the order of sentences is scrambeled, these data are not helpful in tasks
that go beyond sentence boundaries. But this design helps us to overcome
copyright issues, as documents are not reconstructible from the corpora
provided and single sentences are not protected by copyright.

Because the information which words co-occur with each other is useful for many
applications, these data ware precomputed and included as well. For each word,
the most significant words appearing

a) as immediate left neighbour

b) as immediate right neighbour

c) anywhere within the same sentence

are given. The quality of such co-occurrence increases with the corpus size, so
we refer to forthcoming larger corpora.

The authors will add larger corpora and new languages soon. The Leipzig Corpora
Collection is also open to include other existing corpora in collaboration with
the corresponding owners.

Please contact: