回复: BFSU Collocator1.0参数求解
不一样。
我们用的是BNCweb公式。
WordSmith和BNCweb各个搭配公式的不同主要不同是BNCweb公式里多考虑到span,而WordSmith所采用的经典搭配公式计算方法,并不都考虑到span。
我们认为搭配是应该考虑span因素的,而不是不受任何限制的词语共现关系,因此,我们采用了BNCweb的公式。
BNCweb Collocations公式请见下面的pdf文件。注意公式中的大S,即span。大家可以比较一下公式中的细微差别。
===========
以下为WordSmith的算法和公式
For computing collocation strength, we can use
· the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)
· the frequency word 1 altogether in the corpus
· the frequency of word 2 altogether in the corpus
· the span or horizons we consider for being neighbours
· the total number of running words in our corpus: total tokens
Mutual Information
Log to base 2 of (A divided by (B times C))
where
A = joint frequency divided by total tokens
B = frequency of word 1 divided by total tokens
C = frequency of word 2 divided by total tokens
MI3
Log to base 2 of ((J cubed) times E divided by B)
where
J = joint frequency
F1 = frequency of word 1
F2 = frequency of word 2
E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)
B = (J + (total tokens-F1)) times (J + (total tokens-F2))
Z Score
(J - E) divided by the square root of (E times (1-P)) where
J = joint frequency
S = collocational span
F1 = frequency of word 1
F2 = frequency of word 2
P = F2 divided by (total tokens - F1)
E = P times F1 times S
Log Likelihood
based on Oakes p. 170-2.
2 times (
a Ln a + b Ln b + c Ln c + d Ln d
- (a+b) Ln (a+b)
- (a+c) Ln (a+c)
- (b+d) Ln (b+d)
- (c+d) Ln (c+d)
+ (a+b+c+d) Ln (a+b+c+d)
)
where
a = joint frequency
b = frequency of word 1
c = frequency of word 2
d := frequency of pairs involving neither w1 nor w2
and "Ln" means Natural Logarithm