New 360 million word American English Corpus

xiaoz · 2008-02-22

Mark Davies' new 360 million word American English Corpus -

We are pleased to announce the initial release of the 360 million word "BYU Corpus of American English" (1990-2007), which is freely available online (http://www.americancorpus.org). New texts will be added at least two times each year from this point on (20 million new words each year; 4 million words in each of the five genres), and it will thus serve as a unique linguistic history of American English since 1990.

CONTENT

The corpus is composed of more than 360 million words in nearly 150,000 texts, including 20 million words each year from 1990-2007. For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:

Spoken: (76+ million words) Transcripts of unscripted conversation from nearly 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc).

Fiction: (70 million words) Short stories and plays from literary magazines, children's magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.

Popular Magazines: (78+ million words) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men's Health, Good Housekeeping, Cosmopolitan, Christian Century, Fortune, Sports Illustrated, etc.

Newspapers: (73+ million words) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. There is also a good mix between different sections of the newspapers, such as local news, opinion, sports, financial, etc.

Academic Journals: (73+ million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year

QUERIES

-- The interface is the same as the interface for the 100 million word British National Corpus and 100 million word TIME Magazine corpus (see http://corpus.byu.edu/)

-- Queries by word, phrase, alternates, substring, part of speech, lemma, synonyms (see below), and customized lists (see below)

-- The corpus is tagged by CLAWS, the same tagger that was used for the BNC and the TIME corpus

-- Chart listings (totals for all matching forms in each genre or year, 1990-present, as well as for sub-genres) and table listings (frequency for each matching form in each genre or year)

-- Full collocates searching (up to ten words left and right of node word)

-- Comparisons between genres or time periods (e.g. collocates of 'chair' in fiction or academic, nouns with 'break the [N]' in newspapers or academic, adjectives that occur primarily in sports magazines, or verbs that are more common 2004-2007 than previously)

-- One-step comparisons of collocates of related words, to study semantic or cultural differences between words (e.g. comparison of collocates of 'small' and 'little', or 'men' and 'women', or 'rob' vs 'steal')

-- Include semantic information from a 60,000 entry thesaurus directly as part of the query syntax (e.g. frequency and distribution of synonyms of 'beautiful', synonyms of 'strong' occurring in fiction but not academic, synonyms of 'clean' + noun ('clean the floor', 'washed the dishes')

-- Create your own 'customized' word lists, and then re-use these as part of subsequent queries (e.g. lists related to a particular semantic category (clothes, foods, emotions), or a user-defined part of speech)

NOTE

Due to copyright and licensing issues, the corpus is not available in full-text form. Rather, as with our interface to the BNC and TIME, all access will be via the web interface, which allows full frequency and distributional charts, and limited KWIC displays (up to 100 words per entry)

============================================
Mark Davies
Professor of (Corpus) Linguistics
Brigham Young University
(phone) 801-422-9168 / (fax) 801-422-0906
Web: davies-linguistics.byu.edu

** Corpus design and use // Linguistic databases **
** Historical linguistics // Language variation **
** English, Spanish, and Portuguese **
============================================

xujiajin · 2008-02-22

回复: New 360 million word American English Corpus

I guess he chose to launch the corpus this time must have something to do with the coming conference in BYU.

cathyz · 2008-02-25

回复: New 360 million word American English Corpus

要用户名和密码?怎么看呢

xujiajin · 2008-02-25

回复: New 360 million word American English Corpus

http://www.americancorpus.org/

xiaoz给的链接是从email里copy出来，其中带了学校Email的登录信息，所以出现要密码账户的情况。
用这里的链接就没问题了。

jerrycheny · 2008-03-13

回复: New 360 million word American English Corpus

很不错的，偶用过了

james_arbor · 2009-06-02

回复: New 360 million word American English Corpus

作者 xujiajin:
http://www.americancorpus.org/

xiaoz给的链接是从email里copy出来，其中带了学校Email的登录信息，所以出现要密码账户的情况。
用这里的链接就没问题了。

New 360 million word American English Corpus

xiaoz

永远的超级管理员

xujiajin

管理员

cathyz

xujiajin

管理员

jerrycheny

初级会员

james_arbor