http://halfedge.blogspot.com/2005/09/web-as-corpus.html
An introduction to the journal’s special issue that gives a good overview on the subject.
With regards to language, the paper give some interesting statistics, regarding language distribution. Most of the discussed techniques use search engines as the interface to the web, instead of actual crawls.
The paper points out that even large corpora constructed from the web might be to small to have any statistical meaning for rare terms (term frequencies follow a Zipf distribution). This observation has to be kept in mind. For example, trying to infer geographic meaning of rare terms (term combinations), we might run into a similar problem.
This special issue to the journal contains another interesting article: Philip Resnik; Noah A. Smith “The Web as a Parallel Corpus”, which I yet have to read.
Adam Kilgarriff; Gregory Grefenstette
Introduction to the Special Issue on the Web as Corpus
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus
Published by The MIT Press for: The Association for Computational Linguistics.
				
			An introduction to the journal’s special issue that gives a good overview on the subject.
With regards to language, the paper give some interesting statistics, regarding language distribution. Most of the discussed techniques use search engines as the interface to the web, instead of actual crawls.
The paper points out that even large corpora constructed from the web might be to small to have any statistical meaning for rare terms (term frequencies follow a Zipf distribution). This observation has to be kept in mind. For example, trying to infer geographic meaning of rare terms (term combinations), we might run into a similar problem.
This special issue to the journal contains another interesting article: Philip Resnik; Noah A. Smith “The Web as a Parallel Corpus”, which I yet have to read.
Adam Kilgarriff; Gregory Grefenstette
Introduction to the Special Issue on the Web as Corpus
Computational Linguistics, Volume 29, Number 3, September 2003: Special Issue on the Web as Corpus
Published by The MIT Press for: The Association for Computational Linguistics.
 
				