BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

本文由 xujiajin2010-04-08 发表於 "编程与工具开发" 讨论区

  1. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    What's the flavor of the regular expressions? Could it be PCRE? If that's the case, the information at http://www.pcre.org/pcre.txt might be helpful. Thanks for any tips.
     
  2. williamJia

    williamJia 开放语料库项目

    附件文件:

  3. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Is it possible to add one more feature to your wonderful program, that is, allowing the user to adjusting the font size in the GUI?

    I showed this tool to my colleagues. They all sang highly of it. But they also complained that it is impossible to use it in the classroom as the font size is too small for the students to see on the projector screen.

    I guess the font size is hard coded into the program. Although it is possible to change it in the output file due to its plain text nature (or use a html browser to do the job), we would be very grateful if we could do that in the GUI too.
     
  4. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    ah, got it, edit the template.ini file!
    It's so flexible. Thanks for your wonderful design.
     
  5. xujiajin

    xujiajin 管理员 Staff Member

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Yes. Modify the template as you do with CSS for html.
     
  6. xujiajin

    xujiajin 管理员 Staff Member

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Different flavors of regex share much of their matching patterns. I am familiar with Perl compatible regex, which works well on Sentence Collector. Try out other flavors of regex, if you want to test the compatibility of the tool.
     
  7. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Thanks for your reply. As far as I know, most programs written in Delphi use a reg flavor of PCRE, but I am not quite sure whether Sentence Collector (SC) also followed suit. Yes, you are right, any user can try it out for sure.

    I am also a bit curious about the reg search function provided by SC(fast and smooth). I am now facing a paradox: It seems that for large data sets, some sort of index should be exploited for speed, but it is also common sense that regular expressions should work directly on line-based files. If the latter is true, I am wondering how we could scale up the system (when the data sets grow HUGE) while keeping the reg search function (in considering that the default data set is less than 6M with SC). Could you share any design decisions you have made in such a situation or I have made a wrong observation? Thanks for any pointers.
     
  8. williamJia

    williamJia 开放语料库项目

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    SC1.0使用的正则表达式不完全是PCRE(Perl Compatible Regular Expressions),CS支持通用的正则表达式语法:
    第一类:
    1. \ Quote the next metacharacter
    2. ^ Match the beginning of the line
    3. . Match any character (except newline)
    4. $ Match the end of the line (or before newline at the end)
    5. | Alternation
    6. () Grouping
    7. [] Character class

    第二类:
    1. * Match 0 or more times
    2. + Match 1 or more times
    3. ? Match 1 or 0 times
    4. {n} Match exactly n times
    5. {n,} Match at least n times
    6. {n,m} Match at least n but not more than m times

    第三类:
    1. \w Match a "word" character (alphanumeric plus "_")
    2. \W Match a non-"word" character
    3. \s Match a whitespace character
    4. \S Match a non-whitespace character
    5. \d Match a digit character
    6. \D Match a non-digit character

    数据量较大时,正则表达式检索效率会很低。一般通过索引技术,先将检索范围缩小,然后再使用正则表达式,这也是SC1.0的思想。对于大型语料库,如纯文本达到100M以上,使用正则表示查询是不现实的,效率会非常低。一般我们会使用搜索引擎技术,通过索引实现,lucene和sphinx都是不错的选择,可以轻松完成对GB级数据的检索。
     
  9. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    informative and insightful, thanks a lot.
     
  10. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    :)谢谢许博士,功能很好!
    我想问有没有办法有针对性地检索个人制作的语料库?如何利用个人制作的语料库制作成您所设计的文件并用于检索?
     
  11. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    贾博士和许博士你们好,请问是从什么语料库提取例句?谢谢
     
  12. xujiajin

    xujiajin 管理员 Staff Member

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    http://www.corpus4u.org/showthread.php?t=3217
    The preloaded corpus is COLEN. The corpus has been indexed and marked up with information of sentence length and new words.

    16楼已经解释过了。
     
  13. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    把其中的一句改成这样:div{margin-left:6px; font-size:60px; font-family:Georgia; margin-top:10pt;margin-bottom:10pt},便可以放大字体,便于在教室的屏幕上使用。
     
  14. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    谢谢许和贾两位博士!testing...
     
  15. williamJia

    williamJia 开放语料库项目

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Doctorized!
     
  16. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    许博士,有空的话可以也录个教程吗?像collocator一样的.谢谢.
     
  17. xujiajin

    xujiajin 管理员 Staff Member

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    这个就不用了吧,输入单词或者正则表达式,回车就行了。
    没有什么操作难度。
     
  18. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    下了来,用着试试
     
  19. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    请问如何使用自己的语料?setting 那里不可以换
     
  20. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    It is possible to use our own corpus in Sentence Collector 1.0, with a little bit twisting with sentence segmentation, sorting, new words marking,file format conversion, index configuration, etc ... But I am not quite sure whether that is what the authors have wanted us to do with it. Certainly if you have gone that far, you might be tempted to write a new tool instead.