The Corpus of Global Web-Based English has been launched

The Corpus of Global Web-Based English (GloWbE) is composed of 1.9 billion words from 1.8 million web pages in 20 different English-speaking countries. The corpus was created by Mark Davies of Brigham Young University, and it was released in April 2013.

GloWbE (pronounced like "globe") is related to other large corpora that we have created, including the 450 million word Corpus of Contemporary American English (COCA) and the 400 million word Corpus of Historical American English (COHA). Together, these three corpora allow researchers to examine variation in English -- by dialect, genre, and over time -- in ways that are not possible with any other large corpora of English.

SIZE: At the most basic level, GloWbE allows you to search through a corpus that is more than four times as large as COCA (and nearly twenty times as large as the British National Corpus). This means that where you might only have 10-12 tokens in the BNC and 50-60 in COCA, you might have 250-300 in GloWbE. (More...)

DIALECTS: The real power of GloWbE, though, is the ability to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. You can also compare any features in two sets of dialects, such as British and American English (in more than 775 million words of text for just these two dialects). Or you could just limit your search to one or two countries (e.g. Australia (148 million words), South Africa (45 million), or Singapore (43 million)), and you'll still be searching the largest online corpus for most of these twenty countries. (More...)

In terms of searches, with GloWbE you can study an extremely wide range of phenomena (the same as with all of the other corpora from words, phrases, grammatical constructions, synonyms, customized lists, and collocates (nearby words, which provide insight into meaning and usage). In addition, for many of these searches, they are 20-30 times as fast as with other corpus architectures like CQPWeb.

To see a number of examples of what you can do with the corpus, feel free to take a quick five minute tour.

Haiyang Ai

Staff member
回复: The Corpus of Global Web-Based English has been launched

Thanks for sharing. This is good resource for research on "web as a corpus".


回复: The Corpus of Global Web-Based English has been launched

I wrote to Prof. Davis and pointed out a political mistake in his introduction. This is his original introduction: "This new corpus is 1.9 billion words in size, and is based on 1.8 million web pages (including blogs) from 20 different English-speaking countries (US, UK, NZ, India, Hong Kong, etc).