A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

最后更新:2005年8月18日
语料库工具箱
1. 什么是语料库工具箱ACWT?
语料库工具箱(ACWT)是指嵌入文字处理软件NoteTab中的一组模块(clips),Perl代码及其他一些中英文文本处理工具。这些工具可以帮助处理一些通常需要昂贵复杂的商业软件才能实现的“又脏又累”的语料库和话语处理和分析工作。

目前工具箱中主要包括以下主要组件:
Text Utilities 文本处理
Merge Files 文件合并
HTML<-->Text Conversion HTML-TXT 格式文件相互转换
Tagged Text --> Plain Text Conversion 去除标注文本中的标记
File comparison/sizes/counts 文本比较/文件大小/字数统计
Chinese Word Segmentation and POS Tagging 汉字文本分词处理及词性标注
Search & Analysis 检索统计
Basic Chinese Concordance 简单汉语检索
Basic English Concordance 简单英语检索
Word List/Frequency 词表/词频表
Mutual Info/T-Score 互现信息/T值
Normed Freq/Ratio/Lexical Density ?/型次比/词汇密度
Interactive Text Tagging 互动加码
L2 Errors C The CLEC Tags 二语学习者错误代码―CLEC赋码集
Discourse Structure C Samples 话语结构标注―样例
Semantics & Pragmatics C Samples 语义语用标注―样例
Sociolinguistics C Samples 社会语言学标注―样例
Syntax C Samples 句法标注―样例
Discourse Transcription 口语转写
The DuBois et al. System - modified 修订版DuBois et al转写体系
Header Info 头文件信息
Voice Quality 音质
Turn Taking 话轮转换
Conversation Structure 会话结构
Metalinguistic 元语言特征
Gesture 肢体语言特征

2. Installation 安装
 
还没有完全翻完,先贴出来。
最后更新:2005年8月18日

语料库工具箱

1. 什么是语料库工具箱ACWT?
语料库工具箱(ACWT)是指嵌入文字处理软件NoteTab中的一组模块(clips),Perl代码及其他一些中英文文本处理工具。这些工具可以帮助处理一些通常需要昂贵复杂的商业软件才能实现的“又脏又累”的语料库和话语处理和分析工作。

目前工具箱中主要包括以下主要组件:
Text Utilities 文本处理
Merge Files 文件合并
HTML<-->Text Conversion HTML-TXT 格式文件相互转换
Tagged Text --> Plain Text Conversion 去除标注文本中的标记
File comparison/sizes/counts 文本比较/文件大小/字数统计
Chinese Word Segmentation and POS Tagging 汉字文本分词处理及词性标注
Search & Analysis 检索统计
Basic Chinese Concordance 简单汉语检索
Basic English Concordance 简单英语检索
Word List/Frequency 词表/词频表
Mutual Info/T-Score 互现信息/T值
Normed Freq/Ratio/Lexical Density 型次比/词汇密度
Interactive Text Tagging 互动加码
L2 Errors C The CLEC Tags 二语学习者错误代码―CLEC赋码集
Discourse Structure C Samples 话语结构标注―样例
Semantics & Pragmatics C Samples 语义语用标注―样例
Sociolinguistics C Samples 社会语言学标注―样例
Syntax C Samples 句法标注―样例
Discourse Transcription 口语转写
The DuBois et al. System - modified 修订版DuBois et al转写体系
Header Info 头文件信息
Voice Quality 音质
Turn Taking 话轮转换
Conversation Structure 会话结构
Metalinguistic 元语言特征
Gesture 肢体语言特征

2. 安装
要运行这些代码需要实现安装4.5及以上版本的NoteTab工具,Perl(解码)程序,以及下面提及的相关工具。缺少了这些工具你可能无法正常运行ACWT中的某些工。这些工具和组件都可以从互联网上免费下载。
NoteTab工具文件:

1) 先从http://www.notetab.com网站下载NoteTab。网...ht)、标准版(Standard)和专业版(Professional)。简易版(Light version)NoteTab是免费软件,可以嵌入前述的各种工具组件。(以下讨论中默认的都是简易版NoteTab。)

2) 将简易版NoteTab安装到Windows操作系统中的。

3) 如果你按照默认的步骤安装NoteTab Light,那么在...\NoteTab Light\路径下应该有一个目录“Libraries”。通过Windows浏览器找到该目录(默认路径应该是C:\Program Files\NoteTab Light\Libraries\)。

4) 将连同这个readme文件一起得到的6个模块文件(即!TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_*.Trans.clb, and 05_Links.clb)拷贝到...NoteTab Light\Libraries目录下。将来如果有更新版本的话也请将这些文件置于同一个文件夹并进行替换。


Perl文件:
3) 从http://www.activestate.com/Products/languages.plex?tn=1下载Active Perl(也可从其他网站下载)并进行安装。请确保所有文件置于C:\Perl\目录及相应的子目录下。安装之后,C:\Perl目录下应当会出现...\bin, ...\lib, ...\docs等若干个文件夹。

4) 将kwic.pl, kwic_e.pl, segment.pl, wordlist.txt几个文件拷贝到C:\Perl\bin目录下。

5) 将segmenter.pl拷贝到C:\Perl\lib目录下。

请务必按指示相应地存储Perl文件,否则一些基于Perl的程序将难以正常运行。

相关组件:NEUCSP 东北大学自然语言实验室汉语分词器 & ICTCLAS 中科院计算所词法分析系统
6) 可以从http://www.nlplab.cn/cipsdk.html 下载到东北大学自然语言实验室汉语分词器NEUCSP。请将NEUCSP安装到C盘根目录,即C:\neucsp。neucsp.exe及其他系统文件都应存在这个目录下。这个程序可以为当前打开的文件进行POS词性标注。在DOS环境下,NEUCSP还可以对多个文件进行分词标注处理,但这不是我们这里谈到的情况。

7) ICTCLAS中科院计算所词法分析系统can be downloaded from http://www.nlp.org.cn/categories/default.php?cat_id=12. Install the program to C:\ictclas, where ictclas.exe can befound. There should be a subdirectory called C:\ictclas\data, where all other system files shouldbe stored.
3. Using the Programs
Run NoteTab Light as a text processor.
By default you should see on the left hand side of your NoteTab Light screen an open window with different clip libraries (a 'library' is a collection of clips, and a clip is a tool. One library (file) may contain several clips.) Select !TK_Start (normally the top one). !TK_Start provides a portal to all the relevant tool groups included in this package.
Switch to any of the tool groups that you see on !TK_Start.
Open a text file (or better create a junk file first for testing), and optionally select some portions of the text, and apply a tool (by clicking on an item) to the file or the selected text.
For the most part these tools are designed to work with the current open document. Others can deal with one or more files on disk.

4. Acknowledgements 致谢
These bits of software are written either by me or by other internet users. Major credits go to:
- Jody Adair, Fookes Software, for various utilities clips.
- Alan Cumming, for the modified kwic & segment Perl scripts as well as their interface clips. I am also grateful to him for answering many of my (amateurish) programming questions.
- Erik Peterson, for the Perl segmentation scripts.

- Ding Zheng ('dzhigner' on Corpus4U.com), for the Perl concordancer. -NEUCSP 东北大学自然语言实验室汉语分词器 & ICTCLAS 中科院计算所词法分析系统 are copyrighted products of their respective authors'.
5. Disclaimer 免责声明
You are authorized to use these programs for non-commercial purposes. Feel free to modify the clips, which are plain text files located under ..\NoteTab Light\Libraries\, to suit your own research needs. It's always a good idea to make backup copies of these files before making any changes.
These programs are provided "As Is". None of the authors involved shall be held responsible for any damages or harms resulted in the use of any of the tools in this collection. Use at your own risk!
6. Support
Any questions should be directed to the on-line discussion forum at Corpus4u (http://www. corpus4u.com). I may be able to answer, from time to time, some questions on the forum, and I may not be able to provide any support at all, as I am not a full-time professional programmer.
These programs have been tested on the English Windows XP Home edition, Service Pack 2. Even though I hope that they will also work on other systems, I have not done any testing and therefore cannot guarantee any success.
For an English Windows XP system to work properly with Chinese texts, support for Simplified Chinese must be enabled, and Chinese (PRC) should be set as the default system for non-Unicode compliant programs. This is done through Control Panel, Regional and Language Options, Languages -check Install Files for Complex Script for... and Install Files for East Asian Languages, and under Advanced, select Chinese (PRC) "to match the language version of the non-Unicode programs you want to use".

欢迎您为我们这个开放式的语料库工具箱提供更多的工具组件或模块

7. 软件更新历史
-2005年8月18日最后更新
* 在文本处理单元(TxtUtils group)中增加了中国东北大学自然语言实验室汉语分词工具NEUCSP和中科院计算所词法分析系统ICTCLAS。
* 修正了原使用说明中的个别错漏。
* 增加了“新增模块”中应用到的相关软件的链接。
-工具箱的首次发布时间:2005年8月15日。
- 1998年秋于纽约的Ithaca开始收集“模块”。

陶红印
Email: ht_ling@sbcglobal.net.
本readme文件于2005年8月18日最后更新。

http://www.corpus4u.org/upload/forum/2005082123130198.rtf
 
回复:A Corpus Worker`s Toolkit:语料库工具箱

I'm most grateful to Dr Xu!

By the way, I'm still working on improving ACWT. I'm trying to
add some more features and optimize the existing ones. Will
share it when it's ready.
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 xujiajin2005-8-21 21:23:08 的发言:
最后更新:2005年8月18日
语料库工具箱
1. 什么是语料库工具箱ACWT?

[... ...]

Normed Freq/Ratio/Lexical Density ?/型次比/词汇密度

Normalized Frequency or Normed Frequency -> 常态化频率 (?)
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 hhyjulia2005-8-23 1:15:24 的发言:
Dear all, I downloaded active perl from the given link http://www.activestate.com/Products/languages.plex?tn=1 and I'm using windows xp.
When I run it, it says 应用程序的os 或os 版本不正确, so what's the matter? Tks!

Click to continue and get to the files for different OSs. Find "Windows", select
the AP Package.

2005082303162593.jpg
 
Got it. Tks! But another problem occurs. I opened a text, and clicked on "segment the open file', there is no response. Then i clicked on the 'read me first', which says somewhat different from your instruction. See the following:

You must have a Perl (interpreter) program installed on c:\perl and have the following files
segment.pl
segmenter.pl
wordlist.txt
under c:\perl\bin\; otherwise you will need to modify the path "c:\perl\bin" in the clip setting to match the Perl program directory on your local machine.
At the moment text files to be processed also need to be placed under \perl\bin. Work is in progress to change this requirement.
First time running Perl script, you may be asked to specify the location of your Perl program; simply go to c:\perl\bin and click on perl.exe to confirm.
You can download a copy of the Perl program by following the link at the bottom of this clip in the Link section.
Last modified: August 15, 2005, Los Angeles, CA.
Hongyin Tao
 
It says the following files
segment.pl
segmenter.pl
wordlist.txt
under c:\perl\bin\; while according to you, segmenter.pl should be under c:\lib\
Right?
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 hhyjulia2005-8-23 3:38:02 的发言:
It says the following files
segment.pl
segmenter.pl
wordlist.txt
under c:\perl\bin\; while according to you, segmenter.pl should be under c:\lib\
Right?

You have downloaded the out-of-date clips. Try the updated ones, which are on page
5 of this thread. (Always have the latest clips and Readme.pdf if possible.)

Any other problems, feel free to ask.
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

Just to clarify:

segmenter.pl should be under c:\perl\lib\

(Important: there is no (or should not be) such a directory called c:\lib that
has anything to do with the perl program.)

You probably have to move the \perl\.. dir up to the root directory to get you:
c:\perl\ if I remember correctly.
 
sorry, just made a mistake, it is c:\perl\lib\
I have just checked again and found my problem is actually in No. 18 which put forward by xiaoz, but i can;t find any thing wrong with the installation procedure. And reasons maybe derive from the following phenomenon: i noticed that one file is missing in my c: perl that is ...\docs

you said"
After the installation, there should be several sub-directories: ...\bin, ...\lib, ...\docs, etc. under C:\Perl."
While mine, under c:\perl, there are ...\bin, ...\eg, ...\html, ...\lib, ...\site
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 hhyjulia2005-8-23 4:06:56 的发言:
sorry, just made a mistake, it is c:\perl\lib\
I have just checked again and found my problem is actually in No. 18 which put forward by xiaoz, but i can;t find any thing wrong with the installation procedure. And reasons maybe derive from the following phenomenon: i noticed that one file is missing in my c: perl that is ...\docs

you said"
After the installation, there should be several sub-directories: ...\bin, ...\lib, ...\docs, etc. under C:\Perl."
While mine, under c:\perl, there are ...\bin, ...\eg, ...\html, ...\lib, ...\site

Actually the \site, \doc etc are not essential. what's important is this:

put segment.pl and wordlist.txt under c:\perl\bin\ and

segmenter.pl under c:\perl\lib

segmenter.pl is different from segment.pl.
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 hhyjulia2005-8-23 5:25:19 的发言:
Many thanks. It turned out due to the out-of-date clips

To prevent such things from happening again, I have updated the files on page 1
as well.
 
You are really helpful.
Another question, 刚刚检索了一篇英文文章,里面有几个汉语解释,在统计词频时,有几个汉字成了乱码,请问是什么问题?
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 hhyjulia2005-8-23 6:12:47 的发言:
You are really helpful.
Another question, 刚刚检索了一篇英文文章,里面有几个汉语解释,在统计词频时,有几个汉字成了乱码,请问是什么问题?

一般来说汉字文章或汉字部分最好要分词(或分字),不然就容易出现乱码。这也是为什么ACWT有分词器
并且把它放在靠前的原因。

Text Statistics, Read Me First says:

Chinese texts need to be segmented first. Use the Segmenter above
to process your text.
 
回复:A Corpus Worker`s Toolkit:语料库工具箱-0819更新

以下是引用 hhyjulia2005-8-23 6:12:47 的发言:

Another question, 刚刚检索了一篇英文文章,里面有几个汉语解释,在统计词频时,有几个汉字成了乱码,请问是什么问题?

I have never tried Chin-English mixed texts before your post. So I was curious
about what might happen with the segmented texts and the stats. I was happy
to find that everything worked pretty well.

Here is the mixed text I used:

The mixed original:
Text Statistics, Read Me First says:一般来说汉字文章或汉字部分最好要分词
(或分字), Use the Segmenter above 不然就容易出现乱码。这也是为什么
ACWT有分词器并且把它放在靠前的原因。Chinese texts need to be segmented
first. Use the Segmenter above to process your text..

Here is the segmented output:

Segmented output (by Peterson's segmenter):
Text Statistics, Read Me First says: 一般 来说 汉字 文章 或 汉字 部分 最好 要 分词
( 或 分 字 ) , Use the Segmenter above 不然 就 容易 出现 乱 码 。 这 也是
为什么 ACWT 有分 词 器 并且 把 它 放在 靠 前 的 原因 。 Chinese texts need to
be segmented first. Use the Segmenter above to process your text.

(Note: 有分 词 器 is wrong here, but ...)

segmented output (by 中科院分词器):
Text Statistics , Read Me First says: 一般来说 汉字 文章 或
汉字 部分 最好 要 分词 ( 或 分 字 ) , Use the Segmenter
above 不然 就 容易 出现 乱 码 。 这 也 是 为什么 ACWT 有
分词 器 并且 把 它 放 在 靠 前 的 原因 。 Chinese texts
need to be segmented first . Use the Segmenter above
to process your text .

Now the stats:

2005082315261361.jpg


(Be careful about the total word counts; individual token counts are ok.)
 
I made slight changes in sentence order and wording ocassionalyy so as to make natural sounding Chinese.

Feel free to make modifications whenever you find necessary on my translation.

最后更新:2005年8月18日

语料库工具箱自述文件(双语版)

1. What is ACWT? 什么是“语料库工具箱”ACWT?
They can do some cheap and dirty corpus/discourse linguistic work for those who can otherwise not afford sophisticated yet expensive commercial software programs. Most of these tools function like macros in word processing programs, but they can do much more and work in a relatively simple text processing environment.
语料库工具箱(ACWT)是指嵌入到文字处理软件NoteTab中的一组模块(clips),Perl代码及其他一些中英文文本处理工具。这些工具可以帮助处理一些通常需要昂贵复杂的商业软件才能实现的“又脏又累”的语料库和话语分析、处理工作。

Major tools included in the Toolkit so far:
目前“工具箱”中主要包括以下组件:
Text Utilities 文本处理
Merge Files 文件合并
HTML<-->Text Conversion HTML-TXT 格式相互转换
Tagged Text --> Plain Text Conversion 去除标注文本中的标记
File comparison/sizes/counts 文本比较/文件大小/字数统计
Chinese Word Segmentation and POS Tagging 汉字文本分词处理及词性标注
Search & Analysis 检索统计
Basic Chinese Concordance 简单汉语检索
Basic English Concordance 简单英语检索
Word List/Frequency 词表/词频表
Mutual Info/T-Score 互现信息/T值
Normed Freq/Ratio/Lexical Density 常态化频率/型次比/词汇密度
Interactive Text Tagging 互动加码
L2 Errors C The CLEC Tags 二语学习者错误代码―CLEC赋码集
Discourse Structure C Samples 话语结构标注―样例
Semantics & Pragmatics C Samples 语义语用标注―样例
Sociolinguistics C Samples 社会语言学标注―样例
Syntax C Samples 句法标注―样例
Discourse Transcription 口语转写
The DuBois et al. System - modified DuBois et al(修订版)转写体系
Header Info 头文件信息
Voice Quality 音质
Turn Taking 话轮转换
Conversation Structure 会话结构
Metalinguistic 元语言特征
Gesture 肢体语言特征

2. Installation 安装
These scripts require the installation of the NoteTab program (4.5 or above), the Perl (interpreter) program, and some other companion utilities to be specified below, all of which are freely downloadable from the internet. If you do not have Perl and the other companion utilities, some of the programs will not work.
要运行这些组件需要安装4.5及以上版本的NoteTab工具,Perl(解码)程序,以及下面提及的相关工具。缺少了这些工具您可能无法正常运行ACWT中的某些组件。这些工具和组件都可以从网上免费下载。

NoteTab Files: NoteTab工具文件:
1) Download NoteTab from http://www.notetab.com. There are at least three different versions of NoteTab: Light, Standard, and Professional. The Light version is free and can be used with these clips. (For the following discussions NoteTab Light will be assumed.)
1) 先从http://www.notetab.com网站下载NoteTab。网站上至少有3种不同版本的NoteTab:简易版(Light)、标准版(Standard)和专业版(Professional)。简易版(Light version)NoteTab是免费软件,可以嵌入前述的各种工具组件。(以下讨论中默认的都是简易版NoteTab。)

2) Install NoteTab Light on to your Windows system.
2) 将简易版NoteTab安装到Windows操作系统中的。

3) There should be a directory called 'Libraries' under ...\NoteTab Light\ if you follow the default installation procedures. Use Windows Explorer to locate this directory (the default path should be: C:\Program Files\NoteTab Light\Libraries\).
3) 如果您按照默认的步骤安装NoteTab Light,那么在...\NoteTab Light\路径下应该有一个目录“Libraries”。通过Windows浏览器找到该目录(默认路径应该是C:\Program Files\NoteTab Light\Libraries\)。

4) Copy the six clip files that you just got along with this file (!TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_*.Trans.clb, and 05_Links.clb) to the ...NoteTab Light\Libraries directory. You can replace them with their updated versions later. You need to keep these files in the same directory.
4) 将您连同这个自述文件一起得到的6个模块文件(即!TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_*.Trans.clb, and 05_Links.clb)拷贝到...NoteTab Light\Libraries目录下。将来如果有更新版本的话也请将这些文件置于同一个文件夹下并进行替换。

Perl Files:
Perl文件:
3) Download Active Perl from http://www.activestate.com/Products/languages.plex?tn=1 (or from some other Web sites) and install it. Make sure that you have the files installed in C:\Perl\ and subdirectories. After the installation, there should be several sub-directories: ...\bin, ...\lib, ...\docs, etc. under C:\Perl.
3) 从http://www.activestate.com/Products/languages.plex?tn=1下载Active Perl(也可从其他网站下载)并进行安装。请确保所有文件置于C:\Perl\目录及相应的子目录下。安装后,C:\Perl目录下应当会出现...\bin, ...\lib, ...\docs等若干个文件夹。

4) Copy the following files to C:\Perl\bin: kwic.pl, kwic_e.pl, segment.pl, wordlist.txt.
4) 将kwic.pl, kwic_e.pl, segment.pl, wordlist.txt几个文件拷贝到C:\Perl\bin目录下。

5) Copy segmenter.pl to C:\Perl\lib.
5) 将segmenter.pl拷贝到C:\Perl\lib目录下。

You must store the Perl files as instructed, or the Perl based programs may not work.
请务必按指示相应地存储Perl文件,否则一些基于Perl的程序将难以正常运行。

Companion Utilities: NEUCSP 东北大学自然语言实验室汉语分词器 & ICTCLAS 中科院计算所词法分析系统
配套组件:东北大学自然语言实验室汉语分词器NEUCSP和中科院计算所词法分析系统ICTCLAS
6) NEUCSP东北大学自然语言实验室汉语分词器can be downloaded from http://www.nlplab.cn/cipsdk.html. Install the program toC:\neucsp, where neucsp.exe and all other system files should be stored. This program provides Parts of Speech (POS) tagged output for the currently open file. (In a Windows-DOS console environment, which is not the case here, it can also handle multiple files.)
6) 可以从http://www.nlplab.cn/cipsdk.html下载到东北大学自然语言实验室汉语分词器NEUCSP。请将NEUCSP安装到C盘根目录,即C:\neucsp。neucsp.exe及其他系统文件都应存在这个目录下。NEUCSP可以为当前打开的文件进行POS词性标注。在DOS环境下,NEUCSP还可以对多个文件进行分词标注处理,但此处无法实现。

7) ICTCLAS中科院计算所词法分析系统can be downloaded from http://www.nlp.org.cn/categories/default.php?cat_id=12. Install the program to C:\ictclas, where ictclas.exe can be found. There should be a subdirectory called C:\ictclas\data, where all other system files should be stored.
7) 可以从http://www.nlp.org.cn/categories/default.php?cat_id=12下载到中科院计算所词法分析系统ICTCLAS。请将ICTCLAS安装到C盘根目录,即C:\ictclas。ictclas.exe应在这个目录下。其他系统文件应当存储在C:\ictclas\data的目录下。

3. Using the Programs 程序的使用方法
Run NoteTab Light as a text processor.
将NoteTab Light作为一个文本编辑工具打开。
By default you should see on the left hand side of your NoteTab Light screen an open window with different clip libraries (a 'library' is a collection of clips, and a clip is a tool. One library (file) may contain several clips.) Select !TK_Start (normally the top one). !TK_Start provides a portal to all the relevant tool groups included in this package.
默认在NoteTab Light窗口的左边您可以看到一个打开的小窗口,上面包含不同的模块单元(clip libraries)(每个模块单元clip library包含一组模块clips。每一个模块clip就是一个语料处理工具。一个单元可以包含多个模块。)选择!TK_Start(通常在最顶端)。!TK_Start相当于提供了一个索引面板,可以帮助用户找到ACWT中所有工具组件。
Switch to any of the tool groups that you see on !TK_Start.
切换到您在!TK_Start上看到的任何一个工具组件。
Open a text file (or better create a junk file first for testing), and optionally select some portions of the text, and apply a tool (by clicking on an item) to the file or the selected text.
For the most part these tools are designed to work with the current open document. Others can deal with one or more files on disk.
打开一个文本文件(或者最好先创建一个无用的文件作测试之用)。然后可以对随意选中的(部分)文本实施相应的操作(点击某个模块)。
绝大多数情况下,工具组件默认处理的是当前打开的文档。其他一些组件可以针对本地硬盘上的一个或多个文件进行操作。

4. Acknowledgements 致谢
These bits of software are written either by me or by other internet users. Major credits go to:
- Jody Adair, Fookes Software, for various utilities clips.
- Alan Cumming, for the modified kwic & segment Perl scripts as well as their interface clips. I am also grateful to him for answering many of my (amateurish) programming questions.
- Erik Peterson, for the Perl segmentation scripts.
- Ding Zheng ('dzhigner' on Corpus4U.com), for the Perl concordancer.
-NEUCSP 东北大学自然语言实验室汉语分词器 & ICTCLAS 中科院计算所词法分析系统 are copyrighted products of their respective authors'.
这些工具组件是我本人及其他一些网络用户编写的。在此对他们表示谢意,他们包括:
- Fookes Software的Jody Adair编写了若干处理模块。
- Alan Cumming修改和编写了kwic及segment 的Perl代码,以及相应的界面模块。同时还要感谢他解答我的一些(业余的)编程问题。
- Erik Peterson编写了分词工具的perl代码。
- 丁政(在Corpus4U.com上的网名是dzhigner)编写了Perl concordancer。
- 东北大学自然语言实验室汉语分词器NEUCSP以及中科院计算所词法分析系统ICTCLAS。

5. Disclaimer 免责声明
You are authorized to use these programs for non-commercial purposes. Feel free to modify the clips, which are plain text files located under ..\NoteTab Light\Libraries\, to suit your own research needs. It's always a good idea to make backup copies of these files before making any changes.
您可以将该软件作非商业用途之用,还可以根据研究需要对其中的模块进行修改。这些纯文本的模块文件都存储在..\NoteTab Light\Libraries\目录下。不过最好在修改之前对原有模块文件加以备份。
These programs are provided "As Is". None of the authors involved shall be held responsible for any damages or harms resulted in the use of any of the tools in this collection. Use at your own risk!
这些工具组件只是“权且用之”式的。软件组件的作者将不会对由这些组件的使用所造成的任何损害承担责任。使用中一切责任自负。

6. Support 软件支持
Any questions should be directed to the on-line discussion forum at Corpus4u (http://www.corpus4u.com). I may be able to answer, from time to time, some questions on the forum, and I may not be able to provide any support at all, as I am not a full-time professional programmer.
These programs have been tested on the English Windows XP Home edition, Service Pack 2. Even though I hope that they will also work on other systems, I have not done any testing and therefore cannot guarantee any success.
如有问题请于在线论坛“语料库语言学在线”Corpus4u (http://www.corpus4u.com) 提出。我会不时在论坛上答复一些问题。当然,由于我并非专业编程人员,或许也无法提供任何形式的答复。这些软件组件在英文版的Windows XP Home edition Service Pack 2进行了测试。尽管我希望它也能在其他系统下正常工作,但因为未经测试,因此不能确保不出问题。

For an English Windows XP system to work properly with Chinese texts, support for Simplified Chinese must be enabled, and Chinese (PRC) should be set as the default system for non-Unicode compliant programs. This is done through Control Panel, Regional and Language Options, Languages -check Install Files for Complex Script for... and Install Files for East Asian Languages, and under Advanced, select Chinese (PRC) "to match the language version of the non-Unicode programs you want to use".
I invite you to contribute to the open collection of this Toolkit by providing more tools and/or templates.
欢迎您为我们这个开放式的语料库工具箱提供更多的工具组件或模块。

7. History 软件更新历史
-Updated August 18, 2005:
* Added NEUCSP 东北大学自然语言实验室汉语分词器 & ICTCLAS 中科院计算所词法分析系统
to the TxtUtils group.
* Corrected some user guide inaccuracies.
* Added links to the relevant programs referenced in the clips.
-First Toolkit release: August 15, 2005.
-First clip collection, Fall 1998, Ithaca, New York
-1998年秋于纽约Ithaca开始收集“模块”。
-工具箱的首次发布时间为:2005年8月15日。
-2005年8月18日更新:
* 在文本处理单元(TxtUtils group)中增加了中国东北大学自然语言实验室汉语分词工具NEUCSP和中科院计算所词法分析系统ICTCLAS。
* 修正了原自述文件中的个别错漏。
* 增加了新增模块中应用到的相关软件的链接。
陶红印
Email: ht_ling@sbcglobal.net
本自述文件于2005年8月18日最后更新(许家金2005-8-24中文翻译完成)
http://www.humnet.ucla.edu/alc/chinese/ACWT/ACWT.htm

语料库工具箱自述文件(双语版)Word文件下载
http://forum.corpus4u.org/upload/forum/2005082411410731.rtf
 
Back
顶部