ANC First Release Frequency Data

Haiyang Ai

Administrator
http://americannationalcorpus.org/frequency.html

Total files are 160M, Windows ZIP format


These are preliminary word frequency counts for the first release of the ANC. The counts will be refined as texts are added and our part of speech tagger(s) are fine-tuned. The data is divided into counts for the entire first release as well as for the spoken texts and written texts.

In addition, three versions of the bigram counts are provided:

Sorted by frequency
Sorted by first word of the bigram
Sorted by second word of the bigram
 
N-gram is similar to Mike Scott's "cluster". Here is what Mike says:

Suppose your text begins like this:

Once upon a time, there was a beautiful princess. She snored. But the prince didn't.

If you've chosen 2-word clusters [i.e. bigrams], the text will be split up as follows:

Once upon
upon a
a time
(note not "time there" because of the comma)
there was (etc.)

With a three-word cluster setting [i.e. trigrams], it would send

Once upon a
upon a time
there was a
was a beautiful
a beautiful princess
But the prince
the prince didn't
(etc.)

That is, each n-word cluster [i.e. n-gram] will be stored, if it reaches n words in length, up to a punctuation boundary, marked by ;,.!? (It seems reasonable to suppose that a cluster does not cross clause boundaries and these punctuation symbols help mark clause boundaries). [But note that some programs also take punctuations etc into account.]
 
Thanks for Richard's detail explanation!

Here's kfNgram's definition:
Here n-gram is understood as a sequence of either n words,
where n can be any positive integer, also known as lexical
bundles, chains, wordgrams, and, in WordSmith, clusters, or
else of n characters, also known as chargrams. When not further
specified here, n-gram refers to wordgrams.
 
Back
顶部