A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

本文由 动态语法2005-08-17 发表於 "编程与工具开发" 讨论区

  1. 动态语法

    动态语法 管理员 Staff Member

    Sept-08-2005 更新文件及说明在第[9]页, #88。

    [​IMG]

    If you want to look at everything on one page, go to:
    http://www.humnet.ucla.edu/alc/chinese/ACWT/ACWT.htm

    _______________________
    献给所有无产学者的
    语料库工具箱
    十分粗略,请多指教。
    ______________________________________

    A Corpus Worker's Toolkit


    1. What is ACWT?

    A Corpus Worker's Toolkit (ACWT) is a collection of NoteTab clips and Perl scripts for Chinese and English text processing. They can do some quick and dirty corpus/discourse linguistic work for those who can otherwise not afford sophisticated yet expensive commercial software programs. Most of these tools function like macros in word processing programs, but they can do much more and work in a simple text processing environment.

    Major tools included in the Toolkit so far:

    Text Utilities 文本处理
    Merge Files
    HTML<-->Text Conversion
    Tagged Text --> Plain Text Conversion
    File comparison/sizes/counts
    Chinese Word-based Segmentation

    Search & Analysis 检索统计
    Basic Chinese Concordance
    Basic English Concordance
    Word List/Frequency
    Mutual Info/T-Score
    Normed Freq/Ratio/Lexical Density

    Interactive Text Tagging 互动加码
    L2 Errors - The CLEC Tags
    Discourse Structure - Samples
    Semantics & Pragmatics - Samples
    Sociolinguistics - Samples
    Syntax - Samples

    Discourse Transcription 口语转写
    The DuBois et al. System - modified
    Header Info
    Voice Quality
    Turn Taking
    Conversation Structure
    Metalinguistic
    Gesture


    2. Installation

    These scripts require the installation of the NoteTab program (4.5 or above) and the Perl (interpreter) program, both of which are freely downloadable from the internet.


    NoteTab Files:

    1) Download NoteTab from http://www.notetab.com. There are at least three different versions of NoteTab: Light, Standard, and Professional. The Light version is free and can be used with these clips. (For the following discussions NoteTab Light will be assumed.)

    2) Install NoteTab Light on to your Windows system.

    3) There should be a directory called 'Libraries' under ...\NoteTab Light\ if you follow the default installation procedures. Use Windows Explorer to locate this directory (the default path should be: C:\Program Files\NoteTab Light\Libraries\).

    4) Copy the six clip files that you just got along with this file (!TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_*.Trans.clb, and 05_Links.clb) to the ...NoteTab Light\Libraries directory. You need to keep these files in the same directory.


    Perl Files:

    3) Download Active Perl from http://www.activestate.com/Products/languages.plex?tn=1 (or from some other Web sites) and install it. Make sure that you have the files installed in C:\Perl\ and subdirectories. After the installation, there should be several sub-directories: ...\bin, ...\lib, ...\docs, etc. under C:\Perl.

    4) Copy the following files to C:\Perl\bin: kwic.pl, kwic_e.pl, segment.pl, wordlist.txt.

    5) Copy segmenter.pl to C:\Perl\lib.

    You must store the Perl files as instructed, or the Perl based programs may not work.


    3. Using the Programs

    Run NoteTab Light as a text processor.

    By default you should see on the left hand side of your NoteTab Light screen an open window with different clip libraries (a 'library' is a collection of clips, and a clip is a tool. One library (file) may contain several clips.) Select !TK_Start (normally the top one). !TK_Start provides a portal to all the relevant tool groups included in this package.

    Switch to any of the tool groups that you see on !TK_Start.

    Open a text file (or better create a junk file first for testing) and apply a tool (by clicking on an item) to it.

    For the most part these tools are designed to work with the current open document. Others can deal with one or more files on disk.


    4. Acknowledgments

    These bits of software are written either by me or by other internet users. Major credits go to:

    - Jody Adair, Fookes Software, for various utilities clips.
    - Alan Cumming, for the modified kwic & segment Perl scripts as well as their interface clips. I am also grateful to him for answering many of my (amateurish) programming questions.
    - Erik Peterson, for the Perl segmentation scripts.
    - Ding Zheng ('dzhigner' on Compus4U.com), for the Perl concordancer.


    5. Disclaimer.

    You are authorized to use these programs for non-commercial purposes. Feel free to modify the clips, which are plain text files located under ..\NoteTab Light\Libraries\, to suit your own research needs. It's always a good idea to make backup copies of these files before making any changes.

    These programs are provided "As Is". None of the authors shall be held responsible for any damages or harms resulted in the use of any of the tools in this collection. Use at your own risk!


    6. Support

    Any questions should be directed to the on-line discussion forum at Corpus4u (http://www.corpus4u.org). I may be able to answer, from time to time, some questions on the forum, and I may not be able to provide any support at all, as I am not a full-time professional programmer.

    These programs have been tested on the English Windows XP Home edition, Service Pack 2. Even though I hope that they will also work on other systems I have not done any testing and therefore cannot guarantee any success.

    For an English Windows XP system to work properly with Chinese texts, support for Simplified Chinese must be enabled, and Chinese (PRC) should be set as the default system for non-Unicode compliant programs. This is done through Control Panel, Regional and Language Options, Languages -check Install Files for Complex Script for... and Install Files for East Asian Languages, and under Advanced, select Chinese (PRC) "to match the language version of the non-Unicode programs you want to use".

    I invite you to contribute to the open collection of this Toolkit by providing more tools and/or templates.


    7. History

    First clip collection, Fall 1998, Ithaca, New York
    Last modified: August 2005, Los Angeles, California
    Hongyin Tao
    Email: ht_ling@sbcglobal.net.
    -----------------------------------------
    User Guide in English:
    http://www.corpus4u.org/upload/forum/2005091015100987.pdf


    中文版读我文件:
    http://www.corpus4u.org/upload/forum/2005091015104942.pdf
     
  2. tiger

    tiger 高级会员

    thanks a lot
     
  3. 动态语法

    动态语法 管理员 Staff Member

  4. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  5. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  6. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  7. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  8. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  9. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  10. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  11. tiger

    tiger 高级会员

    but where are the files?

    i've found the files.
     
  12. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  13. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  14. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    [​IMG]
     
  15. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    My apologies for not being able to do the Chinese version of
    the user guide.
     
  16. 动态语法

    动态语法 管理员 Staff Member

    回复:Announcing A Corpus Worker"s Toolkit

    I suggest that you download the screen captures to see the
    details of some of the files.
     
  17. xusun575

    xusun575 高级会员

    you guys are terrific! great!
     
  18. xiaoz

    xiaoz 永远的超级管理员 Staff Member

    Shoud prove a very handy tool! Many thanks.

    Any idea of what's happing in the following screen dump?

    [​IMG]
     
  19. xujiajin

    xujiajin 管理员 Staff Member

    Download NoteTab from http://www.notetab.com
    There are at least three different versions of NoteTab: Light, Standard, and Professional. The Light version is free and can be used with these clips.
    有些网友没有找到下载地址,这里单独贴一下。
     
  20. 清风出袖

    清风出袖 高级会员

    must the two softwares be installed in the disk where the os is? thanks for your information!