# BFSU Collocator1.0参数求解

#### wxsong

f(n,c)想必是和“检索词”的共现频数，其后的几个参数是据此算出的统计值。

#### 附件

• 61.3 KB 浏览: 47

#### williamJia

##### 开放语料库项目

f(n,c)想必是和“检索词”的共现频数，其后的几个参数是据此算出的统计值。

f(c) 是共现词在语料库中出现的次数
N 是语料库的总词数
f(n) 是节点词在语料库中出现的次数
f(n,c) 节点词和共现词在语料库中共现的次数

#### wxsong

f(c) 是共现词在语料库中出现的次数
N 是语料库的总词数
f(n) 是节点词在语料库中出现的次数
f(n,c) 节点词和共现词在语料库中共现的次数

WST 5结果是： tokens=217468 types=10537, 某节点词在文中共出现71次，某共现词出现6次；
Collocator1.0的结果为：N=6163 f(n)=1 f(c)=2；

Staff member

#### xusun575

##### 高级会员

WST 5结果是： tokens=217468 types=10537, 某节点词在文中共出现71次，某共现词出现6次；
Collocator1.0的结果为：N=6163 f(n)=1 f(c)=2；

#### 附件

• 351 KB 浏览: 40

#### maggieq58

##### 语料人生

BFSU Collocator 关于Z,MI的计算公式和WORDSMITH中的公式是否完全一致？

#### xujiajin

##### 管理员
Staff member

WordSmith和BNCweb各个搭配公式的不同主要不同是BNCweb公式里多考虑到span，而WordSmith所采用的经典搭配公式计算方法，并不都考虑到span。

BNCweb Collocations公式请见下面的pdf文件。注意公式中的大S，即span。大家可以比较一下公式中的细微差别。

===========

For computing collocation strength, we can use
· the joint frequency of two words: how often they co-occur, which assumes we have an idea of how far away counts as "neighbours". (If you live in London, does a person in Liverpool count as a neighbour? From the perspective of Tokyo, maybe they do. If not, is a person in Oxford? Heathrow?)
· the frequency word 1 altogether in the corpus
· the frequency of word 2 altogether in the corpus
· the span or horizons we consider for being neighbours
· the total number of running words in our corpus: total tokens

Mutual Information
Log to base 2 of (A divided by (B times C))

where

A = joint frequency divided by total tokens
B = frequency of word 1 divided by total tokens
C = frequency of word 2 divided by total tokens

MI3
Log to base 2 of ((J cubed) times E divided by B)
where

J = joint frequency
F1 = frequency of word 1
F2 = frequency of word 2
E = J + (total tokens-F1) + (total tokens-F2) + (total tokens-F1-F2)
B = (J + (total tokens-F1)) times (J + (total tokens-F2))

Z Score

(J - E) divided by the square root of (E times (1-P)) where

J = joint frequency
S = collocational span
F1 = frequency of word 1
F2 = frequency of word 2
P = F2 divided by (total tokens - F1)
E = P times F1 times S

Log Likelihood
based on Oakes p. 170-2.
2 times (
a Ln a + b Ln b + c Ln c + d Ln d
- (a+b) Ln (a+b)
- (a+c) Ln (a+c)
- (b+d) Ln (b+d)
- (c+d) Ln (c+d)
+ (a+b+c+d) Ln (a+b+c+d)
)
where
a = joint frequency
b = frequency of word 1
c = frequency of word 2
d := frequency of pairs involving neither w1 nor w2
and "Ln" means Natural Logarithm

#### 附件

• 33.4 KB 浏览: 64

#### jwesther

[FONT=宋体]有关[/FONT]BFSU Collocator 1.0 的数值想请教一下各位！
[FONT=宋体]我只需要此工具提供MI[FONT=宋体]和[/FONT]Z[FONT=宋体]值，来判断显著搭配词。所得结果，[/FONT]Log-log[FONT=宋体]，[/FONT]Log-likelihood[FONT=宋体]有些值为[/FONT]0[FONT=宋体]，但[/FONT]MI[FONT=宋体]，[/FONT]Z[FONT=宋体]值达到显著搭配词的判断标准，那这些词还能被视为显著搭配词吗？[/FONT]Log-log[FONT=宋体]，[/FONT]Log-likelihood [FONT=宋体]这两个数值具体有什么用途？能麻烦各位解释一下吗？我没找到有关清楚解释，不清楚这两个数值在语料库中什么意义。[/FONT][/FONT]
[FONT=宋体][FONT=宋体][FONT=宋体] 希望各位百忙之中能解答一下我的问题！非常感谢啊！！！[/FONT][/FONT]
[/FONT]

Last edited: