回复: [求助]关于各类计算keywords的算法
以下时WS5提供的材料
Formulae Top Previous Next
Reference > formulae
For computing collocation strength, we can use
Mutual Information
Log to base 2 of (A divided by (B times C))
where
A = joint frequency divided by total tokens
B = frequency of word 1 divided by total tokens
C = frequency of word 2 divided by total tokens
MI3
Log to base 2 of ((J cubed) times E divided by B)
where
J = joint frequency
F1 = frequency of word 1
F2 = frequency of word 2
E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)
B = (J + (total tokens-F1)) times (J + (total tokens-F2))
Z Score
(J - E) divided by the square root of (E times (1-P))
where
J = joint frequency
S = collocational span
F1 = frequency of word 1
F2 = frequency of word 2
P = F2 divided by (total tokens - F1)
E = P times F1 times S
Log Likelihood
based on Oakes p. 170-2.
2 times (
a Ln a + b Ln b + c Ln c + d Ln d
- (a+b) Ln (a+b)
- (a+c) Ln (a+c)
- (b+d) Ln (b+d)
- (c+d) Ln (c+d)
+ (a+b+c+d) Ln (a+b+c+d)
)
where
a = joint frequency
b = frequency of word 1
c = frequency of word 2
d := frequency of pairs involving neither w1 nor w2
and "Ln" means Natural Logarithm
See also: this link from Lancaster University, Mutual Information
? the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far
away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour?
From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)
? the frequency word 1 altogether in the corpus
? the frequency of word 2 altogether in the corpus
? the span or horizons we consider for being neighbours
? the total number of running words in our corpus: total tokens