the null hypothesis,
(f(post) * span ) * relative_freq(the)
(2579 * 8) * (1 / 20) = 20632 / 20 = 1031
And in calculating both MI/T-Score, the notion of span is used as a
variable. My question (and confusion) is, why choosing 8, why not
other numbers? is there an optimal number to use?
This is Jen Clear's reply to the inquiry
The decision to use 4 left and 4 right (giving a span of 8) was based
on work done at Birmingham University in the 1970s by Prof. John
Sinclair (using a rather small computer corpus of only a few hundred thousand
words!) which led him to conclude that the "influence" of a lexical
item on its surrounding words dropped quite sharply beyond 4 words in
both directions, but within the 4:4 span the level of "influence" was
not significantly different whichever position was selected. Based on
the data obtained from this preliminary corpus study, the Cobuild
project used 4:4 as a standard span for almost all its collocational
Of course, MI can be calculated for any two lexical items separated by
any number of intervening words, and Ken Church demonstrated in the
mid-1980s that statistically significant (*and* interesting!)
collocations can be calculated over a span of 100 or 200 words.
29 School Road, Moseley, Birmingham, B13 9TF, UK
Oops, thanks for pointing it out. It's helpful to get all the feedbacks.
By the way, if anyone (not necessarily Dr Xu, who has done so much
already) is interested in translating into Chinese the
Readme file (essentially my 'user guide'), feel free to contact me.
It would help to make the Toolkit accessible to more users. Thanks.