A Corpus Worker`s Toolkit:语料库工具箱-0908 更新


Staff member
Sept-08-2005 更新文件及说明在第[9]页, #88。


If you want to look at everything on one page, go to:


A Corpus Worker's Toolkit

1. What is ACWT?

A Corpus Worker's Toolkit (ACWT) is a collection of NoteTab clips and Perl scripts for Chinese and English text processing. They can do some quick and dirty corpus/discourse linguistic work for those who can otherwise not afford sophisticated yet expensive commercial software programs. Most of these tools function like macros in word processing programs, but they can do much more and work in a simple text processing environment.

Major tools included in the Toolkit so far:

Text Utilities 文本处理
Merge Files
HTML<-->Text Conversion
Tagged Text --> Plain Text Conversion
File comparison/sizes/counts
Chinese Word-based Segmentation

Search & Analysis 检索统计
Basic Chinese Concordance
Basic English Concordance
Word List/Frequency
Mutual Info/T-Score
Normed Freq/Ratio/Lexical Density

Interactive Text Tagging 互动加码
L2 Errors - The CLEC Tags
Discourse Structure - Samples
Semantics & Pragmatics - Samples
Sociolinguistics - Samples
Syntax - Samples

Discourse Transcription 口语转写
The DuBois et al. System - modified
Header Info
Voice Quality
Turn Taking
Conversation Structure

2. Installation

These scripts require the installation of the NoteTab program (4.5 or above) and the Perl (interpreter) program, both of which are freely downloadable from the internet.

NoteTab Files:

1) Download NoteTab from http://www.notetab.com. There are at least three different versions of NoteTab: Light, Standard, and Professional. The Light version is free and can be used with these clips. (For the following discussions NoteTab Light will be assumed.)

2) Install NoteTab Light on to your Windows system.

3) There should be a directory called 'Libraries' under ...\NoteTab Light\ if you follow the default installation procedures. Use Windows Explorer to locate this directory (the default path should be: C:\Program Files\NoteTab Light\Libraries\).

4) Copy the six clip files that you just got along with this file (!TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_*.Trans.clb, and 05_Links.clb) to the ...NoteTab Light\Libraries directory. You need to keep these files in the same directory.

Perl Files:

3) Download Active Perl from http://www.activestate.com/Products/languages.plex?tn=1 (or from some other Web sites) and install it. Make sure that you have the files installed in C:\Perl\ and subdirectories. After the installation, there should be several sub-directories: ...\bin, ...\lib, ...\docs, etc. under C:\Perl.

4) Copy the following files to C:\Perl\bin: kwic.pl, kwic_e.pl, segment.pl, wordlist.txt.

5) Copy segmenter.pl to C:\Perl\lib.

You must store the Perl files as instructed, or the Perl based programs may not work.

3. Using the Programs

Run NoteTab Light as a text processor.

By default you should see on the left hand side of your NoteTab Light screen an open window with different clip libraries (a 'library' is a collection of clips, and a clip is a tool. One library (file) may contain several clips.) Select !TK_Start (normally the top one). !TK_Start provides a portal to all the relevant tool groups included in this package.

Switch to any of the tool groups that you see on !TK_Start.

Open a text file (or better create a junk file first for testing) and apply a tool (by clicking on an item) to it.

For the most part these tools are designed to work with the current open document. Others can deal with one or more files on disk.

4. Acknowledgments

These bits of software are written either by me or by other internet users. Major credits go to:

- Jody Adair, Fookes Software, for various utilities clips.
- Alan Cumming, for the modified kwic & segment Perl scripts as well as their interface clips. I am also grateful to him for answering many of my (amateurish) programming questions.
- Erik Peterson, for the Perl segmentation scripts.
- Ding Zheng ('dzhigner' on Compus4U.com), for the Perl concordancer.

5. Disclaimer.

You are authorized to use these programs for non-commercial purposes. Feel free to modify the clips, which are plain text files located under ..\NoteTab Light\Libraries\, to suit your own research needs. It's always a good idea to make backup copies of these files before making any changes.

These programs are provided "As Is". None of the authors shall be held responsible for any damages or harms resulted in the use of any of the tools in this collection. Use at your own risk!

6. Support

Any questions should be directed to the on-line discussion forum at Corpus4u (http://www.corpus4u.org). I may be able to answer, from time to time, some questions on the forum, and I may not be able to provide any support at all, as I am not a full-time professional programmer.

These programs have been tested on the English Windows XP Home edition, Service Pack 2. Even though I hope that they will also work on other systems I have not done any testing and therefore cannot guarantee any success.

For an English Windows XP system to work properly with Chinese texts, support for Simplified Chinese must be enabled, and Chinese (PRC) should be set as the default system for non-Unicode compliant programs. This is done through Control Panel, Regional and Language Options, Languages -check Install Files for Complex Script for... and Install Files for East Asian Languages, and under Advanced, select Chinese (PRC) "to match the language version of the non-Unicode programs you want to use".

I invite you to contribute to the open collection of this Toolkit by providing more tools and/or templates.

7. History

First clip collection, Fall 1998, Ithaca, New York
Last modified: August 2005, Los Angeles, California
Hongyin Tao
Email: ht_ling@sbcglobal.net.
User Guide in English:

回复:Announcing A Corpus Worker"s Toolkit

My apologies for not being able to do the Chinese version of
the user guide.
回复:Announcing A Corpus Worker"s Toolkit

I suggest that you download the screen captures to see the
details of some of the files.
Shoud prove a very handy tool! Many thanks.

Any idea of what's happing in the following screen dump?

Download NoteTab from http://www.notetab.com
There are at least three different versions of NoteTab: Light, Standard, and Professional. The Light version is free and can be used with these clips.