回复: 10年后的BNC该不该退休?谁是继任者?
COCA其实已经相当好了,但取样方面还是比较粗的。主要收录原则还是有什么收什么,比如学术文章、新闻、小说,广播电视脚本。这些的电子本都相对容易获得。COCA的缺陷是没有充分从语言使用的角度入手。
考虑的语言实际使用的有:
Brown family的15个分类,虽然也有问题,但较均衡。
BNC:分类很细,特别是1亿词的口语部分(demographic +context governed)。
ICE:也有非常好的sampling strategy
The corpus is composed of more than 400 million words in more than 160,000 texts, including 20 million words each year from 1990-2009. For each year (and therefore overall, as well), the corpus is evenly divided between the five genres of spoken, fiction, popular magazines, newspapers, and academic journals. The texts come from a variety of sources:
*
Spoken: (83 million words) Transcripts of unscripted conversation from more than 150 different TV and radio programs (examples: All Things Considered (NPR), Newshour (PBS), Good Morning America (ABC), Today Show (NBC), 60 Minutes (CBS), Hannity and Colmes (Fox), Jerry Springer, etc). [See notes on the naturalness and authenticity of the language from these transcripts).
*
【口语部分基本上是广播电视媒体上的内容,似乎不太有日常口语。而BNC里有很多。】
Fiction: (79 million words) Short stories and plays from literary magazines, children’s magazines, popular magazines, first chapters of first edition books 1990-present, and movie scripts.
*
【小说占了近1/4,比重过大】
Popular Magazines: (84 million words) Nearly 100 different magazines, with a good mix (overall, and by year) between specific domains (news, health, home and gardening, women, financial, religion, sports, etc). A few examples are Time, Men’s Health, Good Housekeeping, Cosmopolitan, Fortune, Christian Century, Sports Illustrated, etc.
*
Newspapers: (79 million words) Ten newspapers from across the US, including: USA Today, New York Times, Atlanta Journal Constitution, San Francisco Chronicle, etc. In most cases, there is a good mix between different sections of the newspaper, such as local news, opinion, sports, financial, etc.
*
Academic Journals: (79 million words) Nearly 100 different peer-reviewed journals. These were selected to cover the entire range of the Library of Congress classification system (e.g. a certain percentage from B (philosophy, psychology, religion), D (world history), K (education), T (technology), etc.), both overall and by number of words per year
Because of copyright and licensing issues, the texts themselves are not available for download, under any circumstances. All access to the texts is via this web interface.
time corpus American English word lists word lists frequency BYU Mark Davies