At you can now download full-text data for the two largest BYU corpora:

Corpus of Contemporary American English (COCA). 440 million words of downloadable text; the largest, most up-to-date, publicly-available corpus of English that is balanced for genre (spoken, fiction, magazine, newspaper, and academic).
The corpus of Global Web-Based English (GloWbE). 1.8 billion words of downloadable text; divided into groups from twenty different English-speaking countries (US, UK, Canada, Australia, India, etc). About 60% from blogs, for very informal language.

With this full-text data, you will have the actual corpora on your computer, and you can search the data in any way that you'd like. You can generate your own frequency data, collocates, n-grams, or concordance lines; you can search by word, lemma, and part of speech; and you can carry out complex syntactic and semantic searches offline. You can even modify the lexicon and sources tables to search the corpora in ways that are not possible via the standard web interfaces.

The data comes in three different formats (see samples): data for relational databases (info), word/lemma/PoS (vertical), and linear text (horizontal). When you purchase the data, you purchase the rights to any and all of these formats.

I hope that these resources will be useful to you in your research and teaching.


Mark Davies
回复: COCA+GloWbE

Amazing news, and the price is quite fair. I'm just curious how Mark deals with copyright issue (seems the purchased version can be used for commercial purposes). Another concern is about the sample, where there are a lot of strange signs (@@@@@@@@@@, the page indicator?).