A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

A Corpus Worker`s Toolkit:语料库工具箱-0908更新

以下是引用 xujiajin2005-9-10 9:04:47 的发言:
It is great to get the latest version.

Why Jack Du Bois instead of John Du Bois?

Jack is the diminutive form of John...in the same way as Bob for Robert, Cathy for Cathrine, Mike for Michael, etc.
 
A Corpus Worker`s Toolkit:语料库工具箱-0908更新

Sept 08, 2005 增补内容:
-增加一个连接:Cathy Ball, Georgetown University Chi Square (X2) 网上计算器;
-修改了CLEC标注以及Du Bois的口语转写系统的输入法;
-加入一个多功能的语义体态分析系统(用到kwic_l.pl);
-其他NoteTab clips也有多项细小改动.
所有改动以及所需文件、更新办法等都已放在在第九页第88楼内。因为变动不多,
日期相近,为避免混乱,不再另外放置。

下面介绍这次改动中较重要的一项:多功能语义体态分析系统 (An Aspect Analyzer)。
NoteTab clip 的发烧友(xiaoz, xujiajin等)可以细看,对此不感兴趣的朋友可以跳过。
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

2005091015583036.jpg
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

(注意:图片里的标注例子是为演示目的而随意加入的,跟真正研究时对语料的判断
可能完全不同。)

2005091016012110.jpg


_______________

2005091016014472.jpg



_______________

2005091016051224.jpg
 
I am soory to trouble you all with the question that in my latest uptodated version of ACWT there is no function of An Aspect Analyzer under the 03_DisTagas shown in the screenshot given by 动态语法. Do I miss any clips? Or the function is only available to those super members or the members interested in the function? Secondly, frankly speaking I haven't cleared up all my downloaded files for a couple of days. Today I sorted out one PERL PL file named as FORMAT. I don't know which directory it should go, C:\perl\bin or C:\perl\lib, and what does it concern? By the way the file was downloaded from September 14 seemingly, since the zip file from which I decompressed is read like 20050914 and so on. Thanks a lot for your help! Please extend my sincere wishes to your families over the occasion of tradtional Chinese Moon-Cake Festival! Thanks a lot!
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

以下是引用 清风出袖2005-9-17 18:33:33 的发言:
I am soory to trouble you all with the question that in my latest uptodated version of ACWT there is no function of An Aspect Analyzer under the 03_DisTagas shown in the screenshot given by 动态语法. Do I miss any clips? Or the function is only available to those super members or the members interested in the function?

The best way to make sure that your files are up to date is to check page
9 of this thread, post #88. The files there are the current ones.

No, these files are not open to some users only. they are to be shared by everyone.

Secondly, frankly speaking I haven't cleared up all my downloaded files for a couple of days. Today I sorted out one PERL PL file named as FORMAT. I don't know which directory it should go, C:\perl\bin or C:\perl\lib, and what does it concern? By the way the file was downloaded from September 14 seemingly, since the zip file from which I decompressed is read like 20050914 and so on.

It's not my file but it is safe for you to put it in the \bin dir since that's
where the binary file resides.

Thanks a lot for your help! Please extend my sincere wishes to your families over the occasion of tradtional Chinese Moon-Cake Festival! Thanks a lot!

Thank you for your kind message. Same to you and everyone visiting
this forum.
 
just now I tried to figure out how to use the function of T-score calculation with ACWT. I tried two methods of comparing the likelihoods of association of words "strong" and "thick" with word "smell" by calculating them in the function of compute t-ccore and compute t-score by the ele method. yet the differences in the results often come from the order of the input, i.e. if I input strong in the box then , strong is more likely to asscoiate with word smell, if i input thick first then thick is more so! what is wrong with my calculation?
Result1(this is the most of the result I got when I input the strong first on the top box after I click the module compute t-score by ele method):
* 1st node word 'strong', N=7809
* 2nd node word 'thick', N=1984
* Collocate word 'smell', N=13428
* Frequency of 'strong smell'=175
* Frequency of 'thick smell'=2
* No. of words following either 'strong' or 'thick'=1841
* Corpus size = 44300000 words

Results( if I reverse the order of input, then the result is difeerent)
* 1st node word 'thick ', N=7809
* 2nd node word 'strong', N=1984
* Collocate word 'smell', N=13428
* Frequency of 'thick smell'=175
* Frequency of 'strong smell'=2
* No. of words following either 'thick ' or 'strong'=1841
* Corpus size = 44300000 words

T-Score by the (ELE) method: t=11.94

Hints: 'thick smell' is 11.94 standard deviations more likely than 'strong smell',
or,
'strong smell' is 11.94 standard deviations less likely than 'thick smell'.

According to Church et al, the confidence threshold should be at least 2.15 instead of 1.65.

T-Score by the (ELE) method: t=11.94

Hints: 'strong smell' is 11.94 standard deviations more likely than 'thick smell',
or,
'thick smell' is 11.94 standard deviations less likely than 'strong smell'.






[本贴已被 作者 于 2005年09月20日 14时54分42秒 编辑过]
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

以下是引用 清风出袖2005-9-20 14:49:15 的发言:
just now I tried to figure out how to use the function of T-score calculation with ACWT. I tried two methods of comparing the likelihoods of association of words "strong" and "thick" with word "smell" by calculating them in the function of compute t-ccore and compute t-score by the ele method. yet the differences in the results often come from the order of the input, i.e. if I input strong in the box then , strong is more likely to asscoiate with word smell, if i input thick first then thick is more so! what is wrong with my calculation?
Result1(this is the most of the result I got when I input the strong first on the top box after I click the module compute t-score by ele method):
* 1st node word 'strong', N=7809
* 2nd node word 'thick', N=1984
* Collocate word 'smell', N=13428
* Frequency of 'strong smell'=175
* Frequency of 'thick smell'=2
* No. of words following either 'strong' or 'thick'=1841
* Corpus size = 44300000 words

Results( if I reverse the order of input, then the result is difeerent)
* 1st node word 'thick ', N=7809
* 2nd node word 'strong', N=1984
* Collocate word 'smell', N=13428
* Frequency of 'thick smell'=175
* Frequency of 'strong smell'=2
* No. of words following either 'thick ' or 'strong'=1841
* Corpus size = 44300000 words

T-Score by the (ELE) method: t=11.94

Hints: 'thick smell' is 11.94 standard deviations more likely than 'strong smell',
or,
'strong smell' is 11.94 standard deviations less likely than 'thick smell'.

According to Church et al, the confidence threshold should be at least 2.15 instead of 1.65.

T-Score by the (ELE) method: t=11.94

Hints: 'strong smell' is 11.94 standard deviations more likely than 'thick smell',
or,
'thick smell' is 11.94 standard deviations less likely than 'strong smell'.

See the highlighted items...that's the problem.
 
Smell:

strong (LL= 469.414916)
heavy (LL= 51.851068)

No instance of "thick smell" is found in the BNC.
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

Looks like he was doing some experiments with the T-score calculator.
 
thanks a lot, dr.xiao and mr.动态语法! Thanks a lot for providing us with the great stuff, and I hope to see more scripts available soon! Have a nice day to you both!
 
i found that acwt worked slower or even came into deadlock as i tried to process larger size of files. how can i fix this problem or can i boldly say that it is the Achilles' heel of acwt? forgive me to say that if my wild guess is wrong!
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

If you want a faster processing speed, you need to upgrade to
NoteTab Standard or Professional, which are not free. So it's not
ACWT's problem as far as speed is concerned.
 
Back
顶部