Collocation statistics MI, t, z...

xiaoz · 2005-08-18

MI, t, z...

MI is a statistical formula borrowed from information theory. The MI score is computed by dividing the observed frequency of the co-occurring word in the defined span for the search string (so-called node word), e.g. a 4:4 window, namely four words to the left and four words to the right of the node word, by the expected frequency of the co-occurring word in that span and then taking the logarithm to the base 2 of the result. The MI score is a measure of collocational strength. The higher the MI score, the stronger the link between two items. The closer to 0 the MI score gets, the more likely it is that the two items co-occur by chance. The MI score can also be negative if two items tend to shun each other. Hunston (2002: 71) proposes an MI score of 3 or higher to be taken as evidence that two items are collocates.

However, as Hunston (2002: 72) suggests, collocational strength is not always reliable in identifying meaningful collocations. We also need to know the amount of evidence available for a collocation. This means that the corpus size is also important in identifying how certain a collocation is. In this regard, the t test is useful as it takes corpus size into account. As such, an MI score is not as dependent upon the corpus size as a t score is. The t score can be computed by subtracting the expected frequency from the observed frequency and then dividing the result by the standard deviation. A t score of 2 or higher is normally considered to be statistically significant, though the specific probability level can be looked up in a table of distribution, using the computed t score and the number of degrees of freedom.

While the MI test measures the strength of collocations, the t test measures the confidence with which we can claim that there is some association (Church and Hanks 1990). Collocations with high MI scores tend to include low-frequency words whereas those with high t-scores tend to show high-frequency pairs. As such Church, Hanks and Moon (1994) suggest intersecting the two measures and looking at pairs that have high scores in both measures.

The z score is the number of standard deviations from the mean frequency. The z test compares the observed frequency with the frequency expected if only chance is affecting the distribution. In terms of the procedures of computation, the z score is quite similar to the t score whereas in terms of output, the z score is more akin to the MI score. A higher z score indicates a greater degree of collocability of an item with the node word. The z test is used relatively less frequently than the MI test in corpus linguistics, but it is worth mentioning as it is used in widely used corpus tools such as TACT (Text Analytic Computer Tools) and SARA/Xaira.

MORE ABOUT COLLOCATION STATISTICS: O/E, LL, MI3... through an example.

17.2.2 Collocation statistics
Having obtained the various collocation statistics using BNCWeb, it is now appropriate to discuss their characteristics. These statistical measures are commonly used in corpus linguistics (see unit 6.5).

The most basic statistic used for the calculation of collocations is raw frequency. As shown in Fig. 17.8, the word smell ranks 1st in the column ‘As collocate’. The raw frequency is 71, which means that the word sweet co-occurs with the word smell 71 times (with sweet as a pre-modifier) in the whole BNC. The word ranked 2nd is shop, which is pre-modified by sweet 50 times. For learner dictionaries, the list is quite useful because we can choose the collocates which tend to occur quite frequently and look familiar even to learners of English. Yet as you can see, when sorted by raw frequency of co-occurrence, frequent words crowd into the top of the collocate list. This holds out the possibility that they may not be collocates as such, rather they may simply be high-frequency words. Raw frequency is a poor guide to collocation. Look, for instance, at the third column ‘Total No. in the whole BNC’ for the words smell and shop. You can see immediately the difference in total frequency between the two words (2,537 times for smell and 10,066 times for shop). The raw frequency is not a reliable measure as the total number of occurrences of the word shop in the whole BNC is almost four times greater than that of smell. In the case of smell and shop, while the raw frequency also shows that sweet smell is a stronger collocation than sweet shop, we have to doubt the reliability of the raw frequency as a measure for collocations as it indicates that the combination sweet shop (ranks 2nd) is stronger than sweet pea (ranks 3rd) (see Fig. 17.8). In the case of sweet pea, pea collocates with sweet 49 times whilst its total frequency in the whole BNC is only 612. This indicates that pea shows a very strong preference to collocate with sweet, certainly stronger than shop, which occurs in the BNC 10,066 times but collocates with sweet only 50 times (see Fig. 17.8). In order to measure the strength of association we need to move away from the raw frequency and use other collocation statistics instead which can capture this relative strength of word combination.

One measure which takes into account the total frequencies of a node word and a collocate in relation to the size of the entire corpus is the ‘observed/expected’ score. This measure basically shows how far the results differ from what one would expect by chance alone. To derive a list of collocates sorted by the ‘observed/expected’ score using BNCWeb, select ‘Observed/expected’ from the pull-down menu for ‘statistics’ and press ‘Go’ in Fig. 17.8. The results should look like those given in Fig. 17.10. The list in the figure indicates that smell ranks 11th, with an observed/expected score of 298.4599 while shop ranks 42nd, with an observed/expected score of 52.7938. This rank order is hardly surprising because, as noted, the raw frequency can also give this result. However, if we consider pea and shop again, we can see immediately the advantage of the observed/expected measure over the raw frequency. The observed/expected score for pea is 868.0599 (ranks 5th; peas ranks 6th, with an observed/expected score of 853.8720) whereas the score for shop is 52.9738 (ranks 42nd). This shows clearly that the association between sweet and pea is much stronger than that between sweet and shop.

A more sophisticated statistical measure than the observed/expected score provided by BNCWeb is the z-score. The z-score is a measure which adjusts for the general frequencies of the words involved in a potential collocation and shows how much more frequent the collocation of a word with the node word is than one would expect from their general frequencies (see unit 6.5). To get a list of collocates sorted by the z-score using BNCWeb, select ‘Z-score’ from the pull-down menu for ‘statistics’ and press ‘Go’ in Fig. 17.8. The results are given in Fig. 17.11. The z-score measure is widely used and built into corpus tools such as SARA and its new XML-aware variant Xaira. However, as Dunning (1993) observes, this measure assumes that data is normally distributed (see unit 6.3), an assumption which is not true in most cases of statistical text analysis unless either enormous corpora are used, or the analysis is restricted to only very common words (which are typically the ones least likely to be of interest). As a consequence, the z-score measure can substantially overestimate the significance of infrequent words (cf. Dunning 1993). As can be seen from Fig. 17.11, rare words such as nothings (with an overall frequency of 36 in the BNC, ranks 1st), afton (11, ranks 4th) and marjoram (47, ranks 8th) are given on the top 10 collocate list.

Fig. 17.10 Observed/expected values Fig. 17.11 Z scores

Fig. 17.12 Log-likelihood scores Fig. 17.13 MI scores

Fig. 17.14 MI3 scores Fig. 17.15 Log-log scores

The solution Dunning proposes for this problem is the log-likelihood (LL) score (see unit 6.4). The LL measure does not assume the normal distribution of data. For text analysis and similar contexts, the use of log-likelihood scores leads to considerably improved statistical results. Using the LL test, textual analysis can be done effectively with much smaller amounts of text than is necessary for statistical measures which assume normal distributions. Furthermore, this measure allows comparisons to be made between the significance of the occurrences of both rare and common features (Dunning 1993: 67). Once again, we are fortunate in that BNCWeb provides this statistic, and hence users do not need to resort to statistics packages like SPSS to calculate the LL score. We can select ‘Log-likelihood’ from the pull-down menu for ‘statistics’ and press ‘Go’ in Fig. 17.8 to get a collocate list sorted by the log-likelihood score. The results are given in Fig. 17.12. As can be seen, the top 10 collocates based on LL scores include both frequent and infrequent words (but none of the infrequent words in the top 10 list are as rare as nothings, afton and marjoram).
A quite different approach to measuring collocation is mutual information (MI). The MI measure is not as statistically rigorous as the log-likelihood test, but it is certainly widely used as an alternative to the LL and z-scores in corpus linguistics. Readers can refer back to unit 6.5 for a brief description of the MI statistic. To obtain a list of collocates for sweet sorted by the MI score, select ‘Mutual information’ from the pull-down menu for ‘statistics’ and press ‘Go’ in Fig. 17.8. The results are shown in Fig. 17.13. As shown in the figure, the top 4 collocates on the list (e.g. Afton, nothings, marjoram and smelling) are all rare words which occur less than 100 times (11, 36, 47 and 53 respectively). Sweet Afton is a phrase from the lyrics expressing the beauty of River Afton. Sweet nothings means ‘romantic and loving talk’. Sweet marjoram is the name of a plant. For lexicographical purposes, these are interesting and should be treated in a general-purpose dictionary. However, for pedagogical purposes, these expressions are of secondary importance compared with more basic collocations. These examples show that the MI score, like the z-score, gives too much weight to rare words.

There is a way of rebalancing the MI score to address this problem by giving more weight to frequent words and less to infrequent words. The MI3 score was developed for just this purpose. MI3 achieves this effect by ‘cubing’ observed frequencies (cf. Oakes 1998: 171-172). The cubing of the frequencies gives a much bigger boost to high frequencies than low frequencies, thus achieving the desired effect. To obtain the collocation list sorted by the MI3 score, simply select ‘MI3’ from the pull-down menu for ‘statistics’ and press ‘Go’ in Fig. 17.8. The results are shown in Fig. 17.14. As can be seen, more frequent collocates such as peas, smell, tooth come to the top of the list when MI3 is used. This means that the cubic rebalancing pays off: these collocates are more useful for second language learners at beginning and intermediate levels.

The cubic approach to eliminating any bias in favour of low frequency co-occurrences is not the only remedy to the problem, however. The log-log formula is yet another measure which reduces this undesirable effect of the MI score. The log-log test is basically an extension of the MI formula (see Oakes 1998: 234 for a description). To obtain the collocation list sorted by the MI3 score, simply select ‘Log-log’ from the pull-down menu for ‘statistics’ and press ‘Go’ in Fig. 17.8. The results are given in Fig. 17.15. The list looks quite similar to the one based on MI3. Both measures aim to reduce the undesirable effect of MI and produce a collocation list that shows more high-frequency words with a high rank. If you are interested in lexically unique collocations, however, MI-scores might be more useful.

A comparison of the various statistical measures provided by BNCWeb which we have reviewed so far shows that the raw frequency tends to overvalue frequent words whereas the observed/expected, MI and z-scores tend to put too much emphasis on infrequent words. In contrast, the log likelihood, log-log and MI3 tests appear to provide more realistic collocation information.

[本贴已被 xujiajin 于 2005年08月18日 22时39分44秒编辑过]

xusun575 · 2005-08-18

thank u for this detailed elaboration. i'll keep this post for further study.

xiaoz · 2005-08-18

In practice, we can use MI and z scores to extract collocations reliably if we also define minimum frequency, which must be defined in line the corpus size. WordSmith and Xaira allow users to do this. The combined use of MI/z scores with minimum frequency can avoid the disadvantage of over-emphasizing infrequent co-occurrences.

xusun575 · 2005-08-19

回复：Collocation statistics MI, t, z...

以下是引用 xiaoz 在 2005-8-18 22:58:49 的发言：
In practice, we can use MI and z scores to extract collocations reliably if we also define minimum frequency, which must be defined in line the corpus size. WordSmith and Xaira allow users to do this. The combined use of MI/z scores with minimum frequency can avoid the disadvantage of over-emphasizing infrequent co-occurrences.

very interesting remark.for a better understanding of MI/Z scores, could you give us an example or two here concerning the overemphasized infrequent co-occurrences. thanks.

xiaoz · 2005-08-19

You will see from the following comparison (based on the BNC). For an explanation of these these figures, see my previous posting.

Observed/expected values
http://www.corpus4u.org/upload/forum/2005081901341433.jpeg

Z scores
http://www.corpus4u.org/upload/forum/2005081901345787.jpeg

Log-likelihood scores
http://www.corpus4u.org/upload/forum/2005081901352255.jpeg

MI scores
http://www.corpus4u.org/upload/forum/2005081901355181.jpeg

MI3 scores
http://www.corpus4u.org/upload/forum/2005081901362978.jpeg

Log-log scores
http://www.corpus4u.org/upload/forum/2005081901365630.jpeg

动态语法 · 2005-08-19

回复：Collocation statistics MI, t, z...

以下是引用 xiaoz 在 2005-8-19 1:37:16 的发言：
You will see from the following comparison (based on the BNC). For an explanation of these these figures, see my previous posting.

Observed/expected values
http://www.corpus4u.org/upload/forum/2005081901341433.jpeg

Z scores
http://www.corpus4u.org/upload/forum/2005081901345787.jpeg

Log-likelihood scores
http://www.corpus4u.org/upload/forum/2005081901352255.jpeg

MI scores
http://www.corpus4u.org/upload/forum/2005081901355181.jpeg

MI3 scores
http://www.corpus4u.org/upload/forum/2005081901362978.jpeg

Log-log scores
http://www.corpus4u.org/upload/forum/2005081901365630.jpeg

Very useful stuff. Thanks.

Some details need further elaboration though. E.g., in the first file, there is a
column that says something like this: # of occurences as a collocate.
What does that mean exactly? How do we know which of the total occurences
count as a collocate and which don't?

xiaoz · 2005-08-19

The column "As collocate" show the co-occurring frequency. In the first figure above, for example, "afton" (line 1) occurs in the BNC 11 times (total frequency), 5 of these co-occurs with "sweet".

xiaoz · 2005-08-19

See the collocation section of the BNCWeb manual for more discussion of collocation statistics:

http://www.corpus4u.org/upload/forum/2005081905125272.pdf

dzhigner · 2005-08-21

When it comes to T-test, there seem to be different notions.
http://www.patrickhanks.com/papers/usingStats.pdf
http://nlp.stanford.edu/fsnlp/promo/colloc.pdf
http://www.collocations.de/AM/section4.html

清风出袖 · 2005-08-22

thanks a lot for your great papers, mr xiao and dzhigner!

dzhigner · 2005-08-22

T-test is used to solve two types of collocation discovery problems. It seems that T-test is used in "investigations of how pairs of words are used differently, rather then the association between two words" (Biber, 1998), and in this case the statistical approach is something of Student's t-test. But on the other hand T-test is used in investigating "how probable or improbable it is that a certain constellation will occur"(according to http://nlp.stanford.edu/fsnlp/promo/colloc.pdf). In this case it seems the statistical approach is something less of Student's t-test, but rather a kind of combination of Bernoulli trial and Student's t-test. I am all confused about this.

dzhigner · 2005-08-29

回复：Collocation statistics MI, t, z...

动态语法 · 2005-09-05

回复：Collocation statistics MI, t, z...

以下是引用 dzhigner 在 2005-8-29 16:45:51 的发言：

Re. Log-likelihood, what formula did you use? I followed the BNCWeb method
and got something slightly lower than (but very close to) yours. Just wanted to
confirm.

Contingency Table Used (for the pair 'law~national'):

x -x
y 4 295 (F(n)=299)
-y 238 1014989

(F(c)=242)

wzli · 2005-09-05

回复：Collocation statistics MI, t, z...

Great discussion. Thanks Dr. Xiao
On a seminar in the University of Birmingham, Pernilla once demonstrated a comparison of two collocates lists: one is of T-score, the other is of MI. After sorted. it is strange to see that the two lists are almost utterly different! Very few words in the two lists overlap. It seems the less frequent words can sometimes claim very high MI score. Suppose a word occurs only three times in a corpus, and three times it co-occur with a node, then it is likely that the word is picked out as of high MI score. In contrast, the results of MI score produced by Wordsmith look more appealing to one's intuition.

xiaoz · 2005-09-05

Thanks. As MI scores over-emphasize infrequent items whereas t scores focus on high frequency items, it would be unsurprising that collocation lists based on the two statistical measure do not have much overlap. The two may be of value for different purposes. For learners and pedagogical dictionaries, t scores may be more useful; for a general purpose dictionary, MI or z scores may be also useful as such a dictioanry is supposed to include infrequent collocations.

Many corpus tools (e.g. WordSmith) allow users to define the minimum frequency in combination with a statistical measure such as MI or Z. The default setting for min frequency in WordSmith is 5 I think. For a corpus of modest size, this threshold can effectively exclude infrequent items. I beieve that's why the MI-based list produced by WordSmith appears to be more appealing to intuitions.

wzli · 2005-09-06

回复：Collocation statistics MI, t, z...

That accounts for it. Thanks.

dzhigner · 2005-09-06

回复：Collocation statistics MI, t, z...

以下是引用动态语法在 2005-9-5 12:10:25 的发言：

Re. Log-likelihood, what formula did you use? I followed the BNCWeb method
and got something slightly lower than (but very close to) yours. Just wanted to
confirm.

Contingency Table Used (for the pair 'law~national'):

x -x
y 4 295 (F(n)=299)
-y 238 1014989

(F(c)=242)

We might have used different contingency tables, but close is ok, is it?

沂灵happy · 2014-07-18

回复: Collocation statistics MI, t, z...

作者 xiaoz:
Thanks. As MI scores over-emphasize infrequent items whereas t scores focus on high frequency items, it would be unsurprising that collocation lists based on the two statistical measure do not have much overlap. The two may be of value for different purposes. For learners and pedagogical dictionaries, t scores may be more useful; for a general purpose dictionary, MI or z scores may be also useful as such a dictioanry is supposed to include infrequent collocations.

Many corpus tools (e.g. WordSmith) allow users to define the minimum frequency in combination with a statistical measure such as MI or Z. The default setting for min frequency in WordSmith is 5 I think. For a corpus of modest size, this threshold can effectively exclude infrequent items. I beieve that's why the MI-based list produced by WordSmith appears to be more appealing to intuitions.

在文章中引用您上文以及第一个帖子中的话应该怎样写参考文献呢？谢谢您

沂灵happy · 2014-07-19

回复: Collocation statistics MI, t, z...

本帖内容是出自您的这本书吗？Corpus-based Language Studies: An advanced resource book

xiaoz · 2014-07-21

回复: Collocation statistics MI, t, z...

Yes. You can make reference to the book.

作者沂灵happy:
本帖内容是出自您的这本书吗？Corpus-based Language Studies: An advanced resource book

Collocation statistics MI, t, z...

xiaoz

永远的超级管理员

xusun575

高级会员

xiaoz

永远的超级管理员

xusun575

高级会员

xiaoz

永远的超级管理员

动态语法

管理员

xiaoz

永远的超级管理员

xiaoz

永远的超级管理员

dzhigner

Moderator

清风出袖

高级会员

dzhigner

Moderator

dzhigner

Moderator

动态语法

管理员

wzli

普通会员

xiaoz

永远的超级管理员

wzli

普通会员

dzhigner

Moderator

沂灵happy

沂灵happy

xiaoz

永远的超级管理员