Google releases their database of N-grams

动态语法

管理员
Staff member
Google releases their database of N-grams

Google, one of the world's biggest data collectors anywhere, is
releasing their collection of 5-grams as freely available data.
Anyone who is interested in doing research on techniques that
use N-grams can now wallow in an ocean of data.

Following is an excerpt from the Google announcement.

John Sowa
__________________________________________________________________

http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Google Research

All Our N-gram are Belong to You

8/03/2006 11:26:00 AM
Posted by Alex Franz and Thorsten Brants,
Google Machine Translation Team

Here at Google Research we have been using word n-gram models for a
variety of R&D projects, such as statistical machine translation, speech
recognition, spelling correction, entity detection, information
extraction, and others. While such models have usually been estimated
from training corpora containing at most a few billion words, we have
been harnessing the vast power of Google's datacenters and distributed
processing infrastructure to process larger and larger training corpora.
We found that there's no data like more data, and scaled up the size of
our data by one order of magnitude, and then another, and then one more
- resulting in a training corpus of one trillion words from public Web
pages.

We believe that the entire research community can benefit from access to
such massive amounts of data. It will advance the state of the art, it
will focus research in the promising direction of large-scale,
data-driven approaches, and it will allow all research groups, no matter
how large or small their computing resources, to play together. That's
why we decided to share this enormous dataset with everyone. We
processed 1,011,582,453,213 words of running text and are publishing the
counts for all 1,146,580,664 five-word sequences that appear at least 40
times. There are 13,653,070 unique words, after discarding words that
appear less than 200 times.

Watch for an announcement at the LDC, who will be distributing it soon,
and then order your set of 6 DVDs. And let us hear from you - we're
excited to hear what you will do with the data, and we're always
interested in feedback about this dataset, or other potential datasets
that might be useful for the research community.
 
回复:Google releases their database of N-grams

的确是网络有问题,可能是被误封了。有些类似的网站可以通过无界浏览来访问。
 
回复:Google releases their database of N-grams

以下是引用 oscar32006-8-5 23:35:58 的发言:
的确是网络有问题,可能是被误封了。有些类似的网站可以通过无界浏览来访问。

如何用“无界浏览”访问?
 
回复:Google releases their database of N-grams

关于“wu jie liu lan”你还是用搜索工具找一下,不便在这里说多了。但是,的确是个很好的工具。很多国外的网站上不了,用上这个工具多半很快会联上。不过,请慎言慎用

[本贴已被 作者 于 2006年08月06日 16时04分16秒 编辑过]
 
Back
顶部