BYU Corpus of American English will be released the week of Feb 11-15

golden dragon · 2008-02-05

CONTENT

All from American sources; e.g. TV, radio, magazines, newspapers, journals published in the US

20 million words each year, 1990 – present

360 million words, as of December 2007 (17 yrs @ 20m each)

Will be expanded at least four times each year (5m words every 3 months; 1m from each register)

Divided (overall, and for each year) into five equally-sized registers:

Spoken: Transcripts of unscripted conversation from more than 50 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc)

Fiction: Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, movie and TV scripts

Popular Magazines: Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples from the 100 titles are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.

Newspapers: Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.

Academic Journals: Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year

FEATURES

Interface similar to our interface for the British National Corpus and TIME Magazine

Queries by word, phrase, alternates, substring, part of speech, lemma, and customized lists (e.g. user-created lists related to a particular semantic category, or a user-defined part of speech

Corpus will by tagged by CLAWS, the same tagger that was used for the BNC

Chart listings (totals for all matching forms in each register or year, 1990-present) and table listings (frequency for each matching form in each year or register)

Full collocates searching (up to ten words left and right of node word)

Comparisons between registers or time period (e.g. collocates of a given word that are more common in one register than another, or which appear only after 2003)

Comparison of collocates of related words (e.g. one-step comparison of collocates of big and large, or with men and women, or comparison of collocates of chair in different registers)

Incorporation of semantic information from WordNet directly into query (e.g. search for all synonyms of sweet + type of food, or find the frequency of all verbs related to walk)

Due to copyright and licensing issues, the corpus will not be available in full-text form (as it is with the CD-ROM version of the BNC). Rather, as with our interface to the BNC, all access will be via the web interface, which allows limited KWIC displays (up to 100 words per entry)

http://www.americancorpus.org/

BYU Corpus of American English will be released the week of Feb 11-15

golden dragon