A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

本文由 动态语法2005-08-17 发表於 "编程与工具开发" 讨论区

  1. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    Glad it works.

    A small tip for using NoteTab Light to work with ACWT:

    Delete (or compress and save) all the other clip libraries in the
    ...\Notetab Light\Libraries directory
    that come with the NoteTab Light program. Just keep the ACWT library files in it,
    i.e., just keep !TK_Start.clb, 01_TextUtl.clb, 02_WdL_Conc.clb, 03_DiscTag.clb, 04_Trans.clb, and 05_Links.clb.

    This way your ...\Libraries dir will not be cluttered and the desktop will be clean as well
    when you run NoteTab Light (and ACWT). If you really want to use the system libraries
    you can always put them back in.
     
  2. yinghuang

    yinghuang 高级会员

    请问诸位大虾,为什么检索汉语时词汇词频、标点符号频数和汉字数都有问题。尤其是汉字数与Ms Word统计的相差太多。而Ms Word也有问题,就是它把标点符号数也算作字来统计。所以,我本来想用Ms word统计的字数减去ACWT统计出的标点数,则是纯汉字数。无奈,。。。
     
  3. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    Can you describe in more detail how your texts look like?
     
  4. yinghuang

    yinghuang 高级会员

    Thank u for ur concern of my problem with ACWT. I find ACWT makes some problems in identifying some, though not all, Chinese characters. To be specific, when i have segmented the chinese text, then subject it to Text Statistics (under the menu, Tools), the problem will occur that some chines characters cannot be recognised as chinese characters at all. My pressing question is: how can i make ACWT identify these so-called unidentifiable chinese characters, or how can i improve ACWT in some way, under the condition that my computer skill is not so satisfying?
    Ur timely response will be highly appreciated. Thank u again!
     
  5. 清风出袖

    清风出袖 高级会员

    i found that the function of stripping of the tags in ACWT doesn't work well on the stu6.txt file of the CELC. i ran acwt over it for 3 or 4 times only to find acwt cry out 'out of memory' and a series of other bumps. what's wrong with the function? could anyone give me some hint on this? thanks a lot!
     
  6. 清风出袖

    清风出袖 高级会员

    A Screenshot to Illustrate My Point
    What's wrong with my ACWT? The same error information appears when I try to strip of tags of an English txt.file as well, ie. st6.txt from CELC.[​IMG]


    [本贴已被 作者 于 2005年11月01日 01时35分40秒 编辑过]
     
  7. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    ACWT uses a regular expression to strip off the tags. In order for the error
    not to apepar, you need to do either one or both of the following two things:
    1) make your text short;
    2) increase your system's memory (RAM).

    If the problem persists, I suggest that you use the ICTCLAS tokenizer. It has the
    option of segmenting your text without putting on POS tags to the plain text.
     
  8. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    What's wrong with my ACWT? The same error information appears when I try to strip of tags of an English txt.file as well, ie. st6.txt from CELC

    ------

    What do you mean by an 'English text'? is the English text tagged by NEUCSP?
    NEUCSP is a Chinese tagger. The tag stripper is desgined for this specific tagger
    because the tag format is very specific.
     
  9. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    以下是引用 yinghuang2005-10-28 14:10:15 的发言:
    Sounds like you are using the Tools clip that comes with NoteTab Light. It is not
    part of ACWT. All ACWT clip libraries (except !TK_Start) are marked with a number:

    01_...02_...03_...

    You should try the stuff under 02_..or 03_...

    (Also see an earlier post I did as a tip for using ACWT: delete all the system clip
    libraries so that you don't get confused by them and ACWT files.)

    And it's good that you have had your text segmented first.
     
  10. 清风出袖

    清风出袖 高级会员

    thanks a lot! i will try to solve it as you suggested. what I mean by 'English Text' is stu subcorpus from CELC. yeterday i attempetd a couple of times to strip of tags from the corpus only to find myself down dnd out. probably the problem has soemthing to do with the light version of notetab since you once said, i remember, ACWT couldn't process a large file with ease.
     
  11. yinghuang

    yinghuang 高级会员

    请问为什么ACWT检索以“开”和“科”为词首的词语时就会出现问题?该如何解决呀?
     
  12. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    Please read post #132 on page 14 of this thread.
     
  13. 求助:ICTCLAS 中科院计算所词法分析系统不能下载,是为何?
     
  14. Aha, I've got it.
     
  15. laohong

    laohong 管理员 Staff Member

    可以下载呀,到这里看看:
    http://mtgroup.ict.ac.cn/~zhp/ICTCLAS/
     
  16. Thank you! I've downloaded it.
     
  17. 清风出袖

    清风出袖 高级会员

    i don't know why the text combination function in acwt doen't work when being asked to process the 911 report downloaded from our site, though i selected only txt. files in the file folder. what's wrong with the files? the function has been pretty efficient in combining files. yet this time it failed my expectation. sigh!
     
  18. 清风出袖

    清风出袖 高级会员

    I see! The documents combination function doesn't work well on UNICODE Big Endian. Probably it is the Achilles's Heels of ACWT. Am I right, Mr. 动态语法?
     
  19. 动态语法

    动态语法 管理员 Staff Member

    回复:A Corpus Worker`s Toolkit:语料库工具箱-0908 更新

    You are right. NoteTab Light is not working well with Unicode documents,
    and I'm not sure if the Pro version does.
     
  20. laohong

    laohong 管理员 Staff Member

    I have the NoteTab pro version. Leave me a message if any of you want to have a try for this Unicode problem.