frankliang
普通会员
Corpora searchable online
BROWN & LOB CORPUS
http://www.edict.com.hk/concordance/WWWConcappE.htm
BRITISH NATIONAL CORPUS (BNC)
http://sara.natcorp.ox.ac.uk/lookup.html
The British National Corpus (BNC) is a one hundred million word corpus of British English, both spoken and written. The BNC Online service allows you to search this corpus in a variety of ways and download citations from the corpus, using a computer connected to the Internet.
If you just want a taste of what is in the BNC, you can perform a simple search using the World Wide Web. You can do this directly from the web browser you are currently using to read this page, without registering. The restricted search interface will not return more than 50 hits, with a maximum of one sentence of context for each, but it will support any legal CQL (Corpus Query Language) query.
1 MILLION WORDS BUSINESS LETTER CORPUS (US & UK) AND OTHER CORPORA
http://ysomeya.hp.infoseek.co.jp/
01 Business Letter Corpus (BLC, contains 1,020,060 word tokens of U.S. and U.K. samples, as of March 1, Y2K)
02 POS tagged BLC (A part-of-speech tagged version of the BLC. Click here for the list of POS tags).
03 Personal Letter Corpus (PLC, contains 113,522 word tokens of American samples, as of June 16, Y2K).
04 POS tagged PLC (A part-of-speech tagged version of the Personal Letter Corpus, as of March 11, 2001).
--- (Letters of Historic Figures)
05-09 Personal Letters by 19th Century Historical Figures (These four corpora contain personal and professional letters by 19th century celebrities. Click here for more details, as of June 15, Y2K)
10 Above 05 to 09 combined (contains 910,363 word tokens).
--- (Literature and Screenplays)
11 Alice's Adventures in Wonderland (Lewis Carroll, 1865: 26,949 word tokens)
12 Through the Looking Glass and What Alice Found There (Lewis Carroll, 1872: 29,888 word tokens).
13 The Adventures of Tom Sawyer (Mark Twain, 1876: 65,942 word tokens).
14 The Adventures of Huckleberry Finn (Mark Twain, 1884: 110,865 word tokens).
15 It's a Wonderful Life (Screenplay by Frank Capra, 1946: 17,066 word tokens)
16 REBECCA (Screenplay by A. Hitchcock, 1940: 16,062 word tokens)
--- (Under construction)
17 U.S. Journalistic Articles (2,102,749 word tokens of U.S. journalistic articles)
18 Learner BLC: WM98 (209,461 word tokens in 1,464 letters written by Japanese business people. All the linguistic surface errors contained in the original data remain as they are.)
MICHIGAN CORPUS OF ACADEMIC SPOKEN ENGLISH (MICASE)
http://www.hti.umich.edu/m/micase/
Welcome to the on-line, searchable part of our collection of transcripts of academic speech events recorded at the University of Michigan.
There are currently 152 transcripts (totaling 1,848,364 words) available at this site.
Browse MICASE
Browse the corpus according to specified speaker and speech attributes, returning quick file references.
Search MICASE
Search the corpus for words or phrases in specified contexts, returning concordance results with references to files, full utterances, and speakers.
WEBCORP C THE WEB AS CORPUS
http://www.webcorp.org.uk/
What is WebCorp?
However large and up-to-date the electronic text corpora available are, there will always be aspects of the language which are too rare or too new to be evidenced in them. WebCorp is a suite of tools which allows access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted.
Who can use WebCorp?
WebCorp can be used by anyone who has an interest in language and how particular words and phrases are used, especially words and phrases which are too new or too rare to appear in any dictionary or standard corpus. Since its launch, WebCorp has been used by corpus linguists, lexicographers, language teachers and learners, publishers, journalists, advertisers, and researchers in a variety of fields. Although WebCorp is designed for linguistic data search, many users have found its results format (with relevant sections of text from multiple web pages collated on one page) useful for information retrieval of the type for which standard search engines are usually used.
Virtual Language Centre
http://www.edict.com.hk/concordance
Corpus Comments Word Count (MS Word) Size
Articles1 This is the default corpus, comprises articles from The Times, SCMP and the entire Brown corpus 3,062,980 18,441 kb
Alice in Wonderland & Through the Looking Glass by Lewis Carroll Two famous 19th century novels, including Jabberwocky 56,336 317 kb
The Starr Report Report published by the US Government written by independent prosecutor Kenneth Starr 98,856 591 kb
Brown Widely used corpus of American English, compiled in 1961. 75% factual writing, 25% fiction. 1,015,484 5985 kb
LOB The LOB (London / Oslo / Bergen Universities) Corpus is a British English counterpart of the Brown Corpus. It contains 500 text samples of about 2,000 words each. 1,015,526 5887 kb
The Times (Jan, Feb, Mar) 3 files Articles published in The Times for Jan-March 1995. Includes business, home news, readers letters and reviews. (Jan)3,567,629;(Feb)3,351,646(Mar)3,301,092 (Jan)22,076 kb; (Feb)20,751 kb; (Mar)20,425 kb
SCMP Miscellaneous texts from the South China Morning Post, compiled by Phil Benson of HKU. 1,202,905 7,272 kb
Business & economy Texts on business & economics, compiled from internet documents, 1998. 119,972 738 kb
Computing Texts on computing, compiled from internet documents, 1998. 170,691 1077 kb
Sports Texts on sport, compiled from internet documents, 1998. 155,539 919 kb
Health Texts on health topics, compiled from internet documents, 1998. 176,566 1078 kb
Students' writing Collection of student writing from HKPU, collected by English Dept staff. Reports, letters & instructions. 230,418 1480 kb
Language & teaching Collected from the Independent Newspaper, articles relating to teaching and language. 96, 497 573 kb
HK Government reports (English) Reports published on the internet in both languages (see parallel texts) 301, 218 1,909 kb
Sherlock Holmes stories by Arthur Conan Doyle A collection of short stories including:The Red Headed League, The Hound of the Baskervilles, A Scandal in Bohemia, A Case of Identity, The Five Orange Pips, The Man with the Twisted Lip, The Adventure of the Speckled Band, The Adventure of the Engineer's Thumb, The Adventure of the Noble Bachelor, The Adventure of the Beryl Coronet, The Adventure of the Copper Beeches, The Blue Carbuncle, The Sign of four 216, 386 1,160 kb
The Bible The King James version, Old and New Testaments 789, 713 4,184 kb
The Hitchhiker's Guide to the Galaxy by Douglas Adams The corpus includes all the five books in the Hithhiker series: The Hitch Hiker's Guide to the Galaxy, The Restaurant at the End of the Universe, Life, the Universe, and Everything, So long, and thanks for all the fish, Mostly harmless 264, 301 1,531 kb
The Lord of the Rings by JR Tolkien Books 1 and 2 "The Fellowship of the Ring" 63,236 366 kb
Agatha Christie stories "The Secret Adversary" and "the Mysterious Affair at Styles" 133, 099 755 kb
Jack London stories Including "Call of the Wild", "The Faith of Men", "The Fish Patrol", "The Game", "The House of Pride", "Island Tales", "On the Makaloa Mat", "Tales of the Klondyke", "Moon Face and other stories", "The Sea Wolf", "The son of the Wolf", "South Sea Tales", "White Fang" 590,553 3,328 kb
Robert Louis Stevenson Including "New Arabian Nights", "Ballantrae", "Catriona", "The Black Arrow", "Dr Jekyll and Mr Hyde", "Fables", "Kidnapped", "Across The Plains", "In the South Seas", "Tales and Fantasies", "Treasure Island" 777,519 4,204 kb
Bram Stoker Several books: "Dracula", "The Jewel of Seven Stars", "The Lady of the Shroud", "The Man", "The Lair of the White Worm" 540, 526 2,869 kb
BROWN & LOB CORPUS
http://www.edict.com.hk/concordance/WWWConcappE.htm
BRITISH NATIONAL CORPUS (BNC)
http://sara.natcorp.ox.ac.uk/lookup.html
The British National Corpus (BNC) is a one hundred million word corpus of British English, both spoken and written. The BNC Online service allows you to search this corpus in a variety of ways and download citations from the corpus, using a computer connected to the Internet.
If you just want a taste of what is in the BNC, you can perform a simple search using the World Wide Web. You can do this directly from the web browser you are currently using to read this page, without registering. The restricted search interface will not return more than 50 hits, with a maximum of one sentence of context for each, but it will support any legal CQL (Corpus Query Language) query.
1 MILLION WORDS BUSINESS LETTER CORPUS (US & UK) AND OTHER CORPORA
http://ysomeya.hp.infoseek.co.jp/
01 Business Letter Corpus (BLC, contains 1,020,060 word tokens of U.S. and U.K. samples, as of March 1, Y2K)
02 POS tagged BLC (A part-of-speech tagged version of the BLC. Click here for the list of POS tags).
03 Personal Letter Corpus (PLC, contains 113,522 word tokens of American samples, as of June 16, Y2K).
04 POS tagged PLC (A part-of-speech tagged version of the Personal Letter Corpus, as of March 11, 2001).
--- (Letters of Historic Figures)
05-09 Personal Letters by 19th Century Historical Figures (These four corpora contain personal and professional letters by 19th century celebrities. Click here for more details, as of June 15, Y2K)
10 Above 05 to 09 combined (contains 910,363 word tokens).
--- (Literature and Screenplays)
11 Alice's Adventures in Wonderland (Lewis Carroll, 1865: 26,949 word tokens)
12 Through the Looking Glass and What Alice Found There (Lewis Carroll, 1872: 29,888 word tokens).
13 The Adventures of Tom Sawyer (Mark Twain, 1876: 65,942 word tokens).
14 The Adventures of Huckleberry Finn (Mark Twain, 1884: 110,865 word tokens).
15 It's a Wonderful Life (Screenplay by Frank Capra, 1946: 17,066 word tokens)
16 REBECCA (Screenplay by A. Hitchcock, 1940: 16,062 word tokens)
--- (Under construction)
17 U.S. Journalistic Articles (2,102,749 word tokens of U.S. journalistic articles)
18 Learner BLC: WM98 (209,461 word tokens in 1,464 letters written by Japanese business people. All the linguistic surface errors contained in the original data remain as they are.)
MICHIGAN CORPUS OF ACADEMIC SPOKEN ENGLISH (MICASE)
http://www.hti.umich.edu/m/micase/
Welcome to the on-line, searchable part of our collection of transcripts of academic speech events recorded at the University of Michigan.
There are currently 152 transcripts (totaling 1,848,364 words) available at this site.
Browse MICASE
Browse the corpus according to specified speaker and speech attributes, returning quick file references.
Search MICASE
Search the corpus for words or phrases in specified contexts, returning concordance results with references to files, full utterances, and speakers.
WEBCORP C THE WEB AS CORPUS
http://www.webcorp.org.uk/
What is WebCorp?
However large and up-to-date the electronic text corpora available are, there will always be aspects of the language which are too rare or too new to be evidenced in them. WebCorp is a suite of tools which allows access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted.
Who can use WebCorp?
WebCorp can be used by anyone who has an interest in language and how particular words and phrases are used, especially words and phrases which are too new or too rare to appear in any dictionary or standard corpus. Since its launch, WebCorp has been used by corpus linguists, lexicographers, language teachers and learners, publishers, journalists, advertisers, and researchers in a variety of fields. Although WebCorp is designed for linguistic data search, many users have found its results format (with relevant sections of text from multiple web pages collated on one page) useful for information retrieval of the type for which standard search engines are usually used.
Virtual Language Centre
http://www.edict.com.hk/concordance
Corpus Comments Word Count (MS Word) Size
Articles1 This is the default corpus, comprises articles from The Times, SCMP and the entire Brown corpus 3,062,980 18,441 kb
Alice in Wonderland & Through the Looking Glass by Lewis Carroll Two famous 19th century novels, including Jabberwocky 56,336 317 kb
The Starr Report Report published by the US Government written by independent prosecutor Kenneth Starr 98,856 591 kb
Brown Widely used corpus of American English, compiled in 1961. 75% factual writing, 25% fiction. 1,015,484 5985 kb
LOB The LOB (London / Oslo / Bergen Universities) Corpus is a British English counterpart of the Brown Corpus. It contains 500 text samples of about 2,000 words each. 1,015,526 5887 kb
The Times (Jan, Feb, Mar) 3 files Articles published in The Times for Jan-March 1995. Includes business, home news, readers letters and reviews. (Jan)3,567,629;(Feb)3,351,646(Mar)3,301,092 (Jan)22,076 kb; (Feb)20,751 kb; (Mar)20,425 kb
SCMP Miscellaneous texts from the South China Morning Post, compiled by Phil Benson of HKU. 1,202,905 7,272 kb
Business & economy Texts on business & economics, compiled from internet documents, 1998. 119,972 738 kb
Computing Texts on computing, compiled from internet documents, 1998. 170,691 1077 kb
Sports Texts on sport, compiled from internet documents, 1998. 155,539 919 kb
Health Texts on health topics, compiled from internet documents, 1998. 176,566 1078 kb
Students' writing Collection of student writing from HKPU, collected by English Dept staff. Reports, letters & instructions. 230,418 1480 kb
Language & teaching Collected from the Independent Newspaper, articles relating to teaching and language. 96, 497 573 kb
HK Government reports (English) Reports published on the internet in both languages (see parallel texts) 301, 218 1,909 kb
Sherlock Holmes stories by Arthur Conan Doyle A collection of short stories including:The Red Headed League, The Hound of the Baskervilles, A Scandal in Bohemia, A Case of Identity, The Five Orange Pips, The Man with the Twisted Lip, The Adventure of the Speckled Band, The Adventure of the Engineer's Thumb, The Adventure of the Noble Bachelor, The Adventure of the Beryl Coronet, The Adventure of the Copper Beeches, The Blue Carbuncle, The Sign of four 216, 386 1,160 kb
The Bible The King James version, Old and New Testaments 789, 713 4,184 kb
The Hitchhiker's Guide to the Galaxy by Douglas Adams The corpus includes all the five books in the Hithhiker series: The Hitch Hiker's Guide to the Galaxy, The Restaurant at the End of the Universe, Life, the Universe, and Everything, So long, and thanks for all the fish, Mostly harmless 264, 301 1,531 kb
The Lord of the Rings by JR Tolkien Books 1 and 2 "The Fellowship of the Ring" 63,236 366 kb
Agatha Christie stories "The Secret Adversary" and "the Mysterious Affair at Styles" 133, 099 755 kb
Jack London stories Including "Call of the Wild", "The Faith of Men", "The Fish Patrol", "The Game", "The House of Pride", "Island Tales", "On the Makaloa Mat", "Tales of the Klondyke", "Moon Face and other stories", "The Sea Wolf", "The son of the Wolf", "South Sea Tales", "White Fang" 590,553 3,328 kb
Robert Louis Stevenson Including "New Arabian Nights", "Ballantrae", "Catriona", "The Black Arrow", "Dr Jekyll and Mr Hyde", "Fables", "Kidnapped", "Across The Plains", "In the South Seas", "Tales and Fantasies", "Treasure Island" 777,519 4,204 kb
Bram Stoker Several books: "Dracula", "The Jewel of Seven Stars", "The Lady of the Shroud", "The Man", "The Lair of the White Worm" 540, 526 2,869 kb
[本贴已被 作者 于 2005年09月19日 16时40分05秒 编辑过]