A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

回复:语料库工具箱 A Corpus Worker`s Toolkit

Have you tried switching the Library window/panel on either side (left/right) of the screen?
回复:语料库工具箱 A Corpus Worker`s Toolkit

PS: Dr Xu, your screen captures are going to be very helpful for
many users here. If some more 'in-action' types can be done in
the future, we will have a good collection of screen shots.
回复:语料库工具箱 A Corpus Worker`s Toolkit

以下是引用 动态语法2005-8-18 22:49:10 的发言:
Have you tried switching the Library window/panel on either side (left/right) of the screen?

This must be what you mean:
I can vaguely remember a line from a linguistics book to the effect that a graph is better than a thousand words sometimes.
回复:A Corpus Worker`s Toolkit:语料库工具箱

J. Clear says that:

the null hypothesis,
(f(post) * span ) * relative_freq(the)
which is
(2579 * 8) * (1 / 20) = 20632 / 20 = 1031

And in calculating both MI/T-Score, the notion of span is used as a
variable. My question (and confusion) is, why choosing 8, why not
other numbers? is there an optimal number to use?

This is Jen Clear's reply to the inquiry
The decision to use 4 left and 4 right (giving a span of 8) was based
on work done at Birmingham University in the 1970s by Prof. John
Sinclair (using a rather small computer corpus of only a few hundred thousand
words!) which led him to conclude that the "influence" of a lexical
item on its surrounding words dropped quite sharply beyond 4 words in
both directions, but within the 4:4 span the level of "influence" was
not significantly different whichever position was selected. Based on
the data obtained from this preliminary corpus study, the Cobuild
project used 4:4 as a standard span for almost all its collocational

Of course, MI can be calculated for any two lexical items separated by
any number of intervening words, and Ken Church demonstrated in the
mid-1980s that statistically significant (*and* interesting!)
collocations can be calculated over a span of 100 or 200 words.

Best wishes

Jem Clear
29 School Road, Moseley, Birmingham, B13 9TF, UK
回复:A Corpus Worker`s Toolkit:语料库工具箱

Yes, I have heard Sinclair talk about this. So it is a variable that people
can (rather arbitrarily) choose.

Thanks for the info.
回复:A Corpus Worker`s Toolkit:语料库工具箱

ACWT Updates!

-Updated August 18, 2005:

* Added NEUCSP 东北大学自然语言实验室汉语分词器 & ICTCLAS 中科院计算所词法分析系统
to the TxtUtils group.

* Corrected some user guide inaccuracies.

* Added links to the relevant programs referenced in the clips.

ReadMe portions about the additions:

6) NEUCSP 东北大学自然语言实验室汉语分词器 can be downloaded from


Install the program to directory

where neucsp.exe and all other system files should be stored.

This program provides Parts of Speech (POS) tagged output for the currently
open file. (In a Windows-DOS console environment, which is not the case here,
it can also handle multiple files.)

7) ICTCLAS 中科院计算所词法分析系统 can be downloaded from


Install the program to C:\ictclas, where ictclas.exe can be found. There should
be a subdirectory called C:\ictclas\data, where all other system files should be

For the latest information about ACWT, see page 1.
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

Oops, thanks for pointing it out. It's helpful to get all the feedbacks.

By the way, if anyone (not necessarily Dr Xu, who has done so much
already) is interested in translating into Chinese the
Readme file (essentially my 'user guide'), feel free to contact me.
It would help to make the Toolkit accessible to more users. Thanks.

7. History
* 在文本处理单元(TxtUtils group)中增加了中国东北大学自然语言实验室汉语分词工具NEUCSP和中科院计算所词法分析系统ICTCLAS。
* 修正了原使用说明中的个别错漏。
* 增加了“新增模块”中应用到的相关软件的链接。
- 1998年秋于纽约的Ithaca开始收集“模块”。

Email: ht_ling@sbcglobal.net.