关于BFSU PowerConc中的Log-likelihood

dzhigner

Moderator
I have notice that Log-likelihood statistics calculated with BFSU PowerConc are zero when coll-freq and conc-freq are equal. This is not the usual way and probably not the right way to do it.

I have referred to some discussions and documentation concerning this issue. Check out this page: http://ucrel.lancs.ac.uk/llwizard.html . Notes 2 on this page says, "when summing we can just ignore cells where x = 0."

I am working on a paper about collocation association measures and have studied how Dunning calculated LLR in the paper where he proposed this method. It is the same thing, simply ignoring the steps of summing where log(0) returns errors. I have written a VBA function that produces exactly the same results as Dunning's LLR method.

I don't know in what programming language PowerConc is written. In my VBA code I broke apart a long formula and used error traps before every line (suppose x is an expression resulting in zero):

On Error Resume Next
temp = temp + x * Log(x)
On Error Resume Next
temp = temp + y * Log(y)

By the way, according to BNCWEB manual, "There is a small error in the way BNCweb implements the log-likelihood formula: In principle, the calculation should be strictly binary and the above formula therefore does not contain the variable 'window span'."

I don't know if LL method is handled the same way in PowerConc. Personally I don't think it is necessary to give up "span", but a little modification is needed.
 
回复: 关于BFSU PowerConc中的Log-likelihood

Many thanks, Ding Laoshi, for the discussion.

Collocational strengths less than and eqaul to zero are displayed as 0 in BFSU PowerConc, as well as in BFSU Collocator. That is to say, 0 does not necessarily mean that coll-freq and conc-freq are equal. At the back end, the values are calculated, but only supressed as zero for practical considerations, as users might not be interested in words which are not (strongly) associated.

We use the same method as BNCweb does.

You are right, it's often the case that a very small number is added to refrain from returning zero or zero division.

In different word association measures, only a few take 'span' into account, many others don't.
Cf.: http://www.collocations.de/AM/index.html
 
回复: 关于BFSU PowerConc中的Log-likelihood

"Collocational strengths less than and eqaul to zero are displayed as 0 in BFSU PowerConc",这个做法,我认为没必要,搭配力强的自然被排序到前面,把0和负数都报告为0,问题倒是不大,但是等于机械的限制了置信度,采用什么样的置信度最好留给User,此外,User不是都盯着那些所谓“显著搭配”。把数据的Full picture原原本本给出来,User自己决定怎么定性或者怎么处理。举个例子吧,major problem这个组合也许算不得搭配,不过somehow interesting,语料库大,这个组合十有八九分低。不甚明白原理的,没准因为分低会下否认这个组合合理性的结论。再者,虽然我的理解有误,但是可以确定coll-freq和conc-freq相等的时候LL分值确实为0,fc1=1, fwc1=1; fc2=3,fwc2=3; fc3=7, fwc3=7,那么,这几个频度不同的共现是不是要区别对待?

"it's often the case that a very small number is added to refrain from returning zero or zero division."这个不是我的意思,我仅仅提到了LL,LL ratio、G-test都有log(0)的问题,一般的做法就是将包含log(0)的表达式赋值为0或者我那种比较轻便的办法,设置错误陷阱,在连加的过程中遇到log(0)错就跳过去。

还有关于Span,的确"only a few take 'span' into account, many others don't",用Span的这种方式和2gram确实不是一个套路,不过原理相同,对模型和变量定义之类的修改一下,就可以通用。BNCweb用了LL,特别做了声明,放弃了span这个变量。这个我认为没必要。

我正在写一篇鬼知道能不能发出去的文章,恰好包含跟这个有关的内容,feeling reluctant to talk about it before I get that damn paper published ...先不详细说了。。。
 
回复: 关于BFSU PowerConc中的Log-likelihood

丁老师说得有道理,0和负值的问题,我们会考虑如实显示其搭配强度值,而不是简单归零。。
 
回复: 关于BFSU PowerConc中的Log-likelihood

可以用,需要你自己填入相应的频数。

是在线的,不是下载下来用的。
 
回复: 关于BFSU PowerConc中的Log-likelihood

可以用,需要你自己填入相应的频数。

是在线的,不是下载下来用的。
嗯,是在线的,自己输入数据了,我昨天用还好好的,今天用点下面那个Calculate LL就只能显示下面
IMMM`9{9J[U26$@3NV)S393.jpg

这样的了
 
Back
顶部