[转贴] 关于T-Score 和 Mutual Information 的好文章

#1
Hunston (2002: 71-73) t-score:

2 or higher; MI score: 3 or higher

Information provided by Wordbank on October 16, 2003

If someone asked you What is the value to lexicographers of (a) MI listings and (b) t-score listings, what (in a nutshell) would you answer?

a) MI: in statistically pure terms this is a measure of the strength of association between words x and y. In a given finite corpus MI is calculated on the basis of the number of times you observed the pair together versus the number of times you saw the pair separately. Now consider two (made-up) examples: first the pair "kith and kin", when "kith" never appears without "kin". In a large corpus one might observe the pair, say, 3 times. The frequency of "kith" itself will almost certainly be 3 also. So, statistically, the probability that you will see "kin" if you've seen "kith" is absolute certainty. While it may not be true vice versa (there may be some instances of "kin" without "kith") the frequency of "kin" overall in the corpus may not be much higher than 3 (perhaps, 6 or 7). The MI will therefore be very *high* -- these two words will be very strongly associated

Now consider a second (made-up) example, "falling prices". Each of these two words is likely to have a medium to high frequency in a large general corpus; let's say f("falling") = 1000 and f("prices") = 1000. One might see the pair together, say, 40 times -- i.e. f("falling prices") = 40. Now the MI figure will not be particularly high, because there will be plenty of evidence of "falling" occurring without "prices" and vice versa. We are nowhere near 100% certainty that one will be accompanied by the other (as we were in the case of "kith and kin") so statistically the strength of association between "falling" and "prices" is much less than it was for "kith" and "kin".

b) t-score: this is a measure not of the strength of association but THE CONFIDENCE WITH WHICH WE CAN ASSERT THAT THERE IS AN ASSOCIATION. There is a subtle difference (maybe it's not subtle -- but I have had great difficulties explaining it to lexicographers/linguists!!) between these two. Consider an analogy from other applications of statistics: suppose I walk onto the street and stop 5 people and ask them a) are they right-handed and b) do they live in this town. Suppose 4 people said they were residents of this town and all were right-handed. Now I computerise my results and produce a MI score for the attributes "living in this town" and "being right handed". The result will be a massively high MI score for the pairing, because I got 100% certainty that people who live in this town are right-handed and 80% (4/5) for the converse. OK, now let's interpret our MI result. Is this an interesting observation? No -- because the sample size was too small and because the number of observations of the pair (right-handed + resident of this town) was also rather low (4). MI tells us that the association is very strong, but the t-score takes account of the ACTUAL number of pairings we observed overall. If we have only met 4 people who are right-handed and live in this town then we cannot have a high degree of confidence that there is a correlation between these two observations. MI says the association is very strong; the t-score says "maybe, but we haven't seen enough evidence to be sure that the MI is right!"

The t-score will take account of the actual number of observations of the pairing to assess whether we can be confident in claiming an association. So to revert to the "kith/kin" "falling/prices" examples, the t-score for "kith/kin" is likely to be lowish, or lower anyway than for "falling" + "prices", because having seen 40 instances of the latter we can be more confident that there is some association (albeit a much weaker one) than we can in the former case. The t-score takes account of the VARIANCE of the co-occurrence figures.

OK, now let me summarise with some observations about the linguistic implications of using the two scores. Here, first, is a crude rule of thumb: MI scores are often high in the case of "weirdos" while t-scores are often high for functional/grammatical pairings. Here is another point to bear in mind: both MI and t-score are non-directional. The points I made in an earlier mail about "upward" and "downward" collocation (when a word collocates either with one much more frequent or one much less frequent) are not directly relatable to MI or t-score. So it doesn't matter whether you choose "whittle" or "down" as your keyword when calculating MI and t: the results will be the same either way.

In a real life corpus, MI will throw up into the limelight any pair for which it is the case that the frequency of cooccurrence is a high proportion of the overall frequency of either of the pair. Proper names, technical terms, idiosyncratic phrases and the like are well highlighted by MI. "Yasser Arafat" will score very highly with MI because the chances of seeing either word without the other are very small. Similarly, locutions particular to some text, like, say, "beet residues", may come up with high MI scores. Why? Because neither "beet" nor "residues" will be very frequent overall in the corpus and it only needs one text on agriculture to use the phrase 4 times and bingo you have evidence for a strong association. You can try to correct for this by ignoring pairings for which the frequency of cooccurrence is low (Ken suggested cutting off at 5, so that our "beet residues" example would be ignored if it occurred only 4 times).

The t-score promotes pairings which have been well attested (i.e. there have been a goodly number of cooccurrences). This is good for more grammatically conditioned pairs, like "depend on", or for stereotyped combinations which are not confined to particular subject fields or texts. Other t-score favourites would be "take stock" or "bad taste". In these cases the strength of association need not be large (no-one would say "oh 'taste' is nearly always associated with 'bad'" or "if you've seen the word 'stock' you know you're going to find 'take' immediately before it"! -- these statements can't be valid since these words occur plenty often in other ways). But the confidence that there is *some* association is pretty high, because in a large corpus you may see "bad taste" 70 times. The t-score has a tendency to promote what you might consider to be uninteresting pairings on the basis of their high frequency of cooccurrence.

Let's look at some examples: for TASTE. Below are the first 15 collocates (span of 4:4) taken from the MI and t-score listings.

TASTE (collocates by MI)
arbiters 20 0.008 5 9.26
ve 29 0.012 3 7.99
aroma 97 0.040 9 7.83
buds 109 0.044 10 7.81
decency 127 0.052 11 7.73
arbiter 59 0.024 5 7.70
salty 64 0.026 5 7.58
's 1411 0.575 99 7.43
sans 45 0.018 3 7.35
seasoning 165 0.067 10 7.22
lapses 70 0.029 4 7.13
savour 72 0.029 4 7.09
texture 382 0.156 19 6.93
pepper 465 0.190 20 6.72
salt 900 0.367 37 6.66
sour 248 0.101 9 6.48

TASTE (collocates by t)
's 1411 0.571 99 9.92
for 252486 102.121 256 9.64
and 720454 291.395 451 7.53
good 22178 8.970 71 7.39
bad 4484 1.814 56 7.28
salt 900 0.364 37 6.06
season 4216 1.705 31 5.31
first 37597 15.207 51 5.06
with 188775 76.352 134 5.01
pepper 465 0.188 20 4.49
popular 3628 1.468 22 4.43
acquired 1185 0.479 20 4.42
texture 382 0.155 19 4.38
my 37077 14.996 42 4.22
bitter 861 0.348 17 4.10
smell 895 0.362 17 4.10

What's the difference? The MI list has "arbiter(s)" figuring very prominently, also "salty", and "sans". This is why I am not a great fan of MI for lexicogs -- do you want to put "sans taste (... sans everything)" in your dictionary? What about the cliche "arbiter of taste"? Why do these have big MI scores? Because these collocates are not very frequent overall ("arbiters" occurs only 20 times in the whole corpus; "sans" only 45 times) so to occur 3 times with "taste" is a big deal in terms of strength of association. "sans" and "ve" are weirdos, and they get into the list because of some peculiarity of this corpus (someone spouting Shakespeare in one case perhaps, and a spurious enclitic form of "have" -- most probably caused by use of double quote instead of apostrophe in 'we"ve' or 'they"ve'). MI loves the weirdos. "buds", of course, is your best collocation here -- good sound compound, must go in the dictionary, no problem.

The t list on the other hand has "good", "bad", "popular" and "acquired" showing up. These seem to me to be more interesting lexicographically: first these collocates are pointers to the semantic area of "taste" meaning 'socially acceptable aesthetic/moral/ethical attitudes" and second these collocates are more *typical* of the collocational behaviour of "taste" than "salty", "sans" or "arbiters". Note also that "for" gets a high t score (because of the locution "have/develop/acquire/get/etc a taste for sth"). This idiomatic usage must surely be dealt with in a learners' dictionary. Indeed, Cobuild includes sense 5 "a taste for sth" and OALD likewise. Cobuild and ALD draw special attention to "in good/bad taste". Both deal with "acquired taste" under ACQUIRE. These collocates will not score high MI values because they are themselves frequent words in English and it cannot be said that "taste" strongly predicts "good" or vice versa. "for" does appear in the MI list -- it is at rank 205 with an MI (MI=1.31) score which is below the "significant" level (MI=1.5) and you wouln't have picked it up down there amongst the dross. [Remember: "of" doesn't occur in here at all because I have it on a stop list in my indexing software.]

"buds" didn't get into the very top of the t-scores because we only saw 10 of them out of a corpus of 30m words. But it ranked 27th in the t-score list, with a t value (t=3.23) well above significance (t=1.65). So the t-score seems to get more typical collocates *without* missing the important MI collocates.

-----------------------------------------------------
[This was an explanation of t-score that I sent out to a CobuildDirect user who enquired about it.]

Look at the collocations for the node "post" in the 20m CobuildDirect corpus. It co-occurs with many words, among which are "the", "office" and "mortem".

The observable facts are that "post" has an overall corpus freq of 2579 (let's refer to this as f(post)=2579) and also

f(office) = 5237
f(the) = 1019262
f(mortem) = 51

We also observe the number of times these words co-occurred with "post" (for shorthand I'll write j(the) = 1583 to mean that "the" occurred with "post" 1583 times: this is the "joint" frequency). So

j(the) = 1583
j(office) = 297
j(mortem) = 51

Now if we were to list the collocates of "post' by raw frequency of co-occurrence we would order them according to j(x), as above. Of course, a full collocation listing of "post" in this form would have many other words with intermediate frequencies -- we are just focussing on these three words for the moment. But the ordering shown above doesn't tell us anything much about the strength of association between "post" and these other words: it is simply a reflection of the basic overall frequency of the collocating words (i.e. "the" is much more frequent than "office" which is much more frequent than "mortem"). We just showed that in the f(x) list! This is true in general: ordering collocates by j(x) simply places words like "the", "a", "of", "to" at the top of every collocate list. What we would like to know is

------------------------------------------------------------------------
IMPORTANT QUESTION: to what extent does the word "post" condition its lexical environment by selecting particular words with which it will co-occur?
------------------------------------------------------------------------

We can compare the relative frequencies of what we observed with what we would expect under the null hypothesis:


------------------------------------------------------------------------
NULL HYPOTHESIS: the word "post" has no effect whatsoever on its lexical environment and the frequencies of words surrounding "post" will be exactly (give or take random fluctuation) the same as they would be if "post" were present or not.
------------------------------------------------------------------------

That is, if "the" has an overall relative frequency of 1 in 20 (about 1m occurrences in a 20m word corpus -- see f(the) above) then we can expect "the" to occur with the same relative frequency in a subset of the corpus which is the 4 words either side of "post": hence under the null hypothesis we would expect j(the) to be

(f(post) * span ) * relative_freq(the)

which is

(2579 * 8) * (1 / 20) = 20632 / 20 = 1031

So under the null hypothesis we would expect j(the) to be 1031. We actually observed j(the) to be 1583, which is rather higher, and we could simply express the difference as ratio (of observed to expected joint frequency) thus:

1583/1031

This is the Mutual Information score and it expresses the extent to which observed frequency of co-occurrence differs from expected (where we mean "expected under the null hypothesis"). Of course, big differences indicate massive divergence from the null hypothesis and indicate that "post" is exerting a strong influence over its lexical environment.

BUT BUT BUT! there is Big Problem with Mutual Information: suppose the word "egregious" appears just once with "post" (not an unreasonable event) in the corpus. And "egregious" may have a very low overall freq:

f(egregious) = 3

Now we carry out the sums to calculate the expected j(egregious) figure. I can assure you it will be a small number! It is:

( f(post) * span ) * relative_freq(egregious)

(2579 * 8) * ( 3 / 20000000)

= 0.0030948

Now you'll see that even if "egregious" occurs just once in the vicinity of "post" the observed j(egregious) will be 323 times more than the expected joint frequency, and the mutual information value will be high. Common sense tells us that since words cannot appear 0.0030948 times -- they either occur zero or one times, nothing in between -- that claiming that "post"+"egregious" is a significant collocation is rather dubious.

In general, the comparison of observed j(x) and expected j(x) will be very unreliable when values of j(x) are low; this is common sense, too. Just because I've seen these two words together once in 20m words doesn't give me much confidence that they are strongly associated: I'd need to see them together several times at least before I could start to feel at all secure in claiming that they have some sort of significant association.

Now here comes T-score. We can calculate a second-order statistic which is, crudely, this:

------------------------------------------------------------------------
IMPORTANT QUESTION: how confident can I be that the association that I've measured between "post" and "egregious" is true and not due to the vagaries of chance?
------------------------------------------------------------------------


T-score answers this question. It takes account of the size of j(x) and weights its value accordingly. A high T-score says: it is safe (very safe/pretty safe/extremely secure etc according to value) to claim that there is some non-random association between these two words.

So t-scores are higher when the figure j(x) is higher. In the case of "egregious" we would get a very low t-score. In the case of "the" the t-score might be quite high, but not huge because "the" doesn't have that strong an association with "post". "office" gets a really high t-score because not only is the observed j(office) way higher than expected, but we seen a goodly number of such co-occurrences, enough to be pretty damn sure that this can't be due to some freak of chance.

In practical terms, raw frequency or j(x) won't tell you much at all about collocation: you'll simply discover what you already knew that "the" is a *very* frequent word and seems to co-occur with just about everything. MI is the proper measure of strength of association: if the MI score is high, then observed j(x) is massively greater then expected, BUT you've got to watch out for the low j(x) frequencies because these are very likely to be freaks of chance, not consistent trends. t-score is best of the lot, because it highlights those collocations where j(x) is high enough not to be unreliable and where the strength of association is distinctly measurable.

Try the different measures: you'll soon see the difference. Raw freq often picks out the obvious collocates ("post office" "side effect") but you have no way of distinguishing these objectively from frequent non collocations (like "the effect" "an effect" "effect is" "effect it" etc). MI will highlight the technical terms, oddities, weirdos, totally fixed phrases, etc ("post mortem" "Laurens van der Post" "post-menopausal" "prepaid post"/"post prepaid" "post-grad") T-score will get you significant collocates which have occurred frequently ("post office" "Washington Post" "post-war", "by post" "the post").

If a collocate appears in the top of both MI and t-score lists it is clearly a humdinger of a collocate, rock-solid, typical, frequent, strongly associated with its node word, recurrent, reliable, etc etc etc.


[本贴已被 作者 于 2005年05月12日 11时11分15秒 编辑过]
 
#4
十分感谢dzhigner,你所提供的文章基本很具权威性,不知你有没有Clear.J .(1993)From Firth Principles:Computational Tools for the Study of Collocation 这篇论文的消息。这也是关于Mutual Information 和T-Score 的。
 
顶部