BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

本文由 xujiajin2010-04-08 发表於 "编程与工具开发" 讨论区

  1. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    hi all,

    I've managed to create an index file compatible with BFSU Sentence Collector 1.0.

    The original English text is taken from part of the 2009 data of the Europarl parallel corpus.

    The index file is created to facilitate better use of corpus in classroom instruction.

    To use the index file, please simply unzip indexes.rar to the "indexes" folder of BFSU Sentence Collector 1.0. The original index_list.ini will be overwritten but no more changes are made to other files of the program.

    Known issues include some mis-representation of scripts other than English. Also, the index file has not gone through manual editing, so sentence boundaries are not necessarily 100% correct. That being said, it is still useful for pedagogic purposes. I hope you will think so too.

    Finally, I'd like to express my sincere appreciation of Dr Xu and Mr Jia's flexible design to make this possible.

    Regards,
    iCasino

    Reference:

    Europarl: A Parallel Corpus for Statistical Machine Translation, Philipp Koehn, MT Summit 2005.
     

    附件文件:

    Last edited: 2012-04-15
  2. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    NB:

    If you want to use the COLEN corpus, please set COLEN.idx=1 in the index_list.ini file.

    If you want to turn the Europarl09 corpus off, please set Europarl09.idx=0 in the index_list.ini file.
     
  3. seanxpq

    seanxpq corpus explorer

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    不知可否讲讲Idx文件是如何生成的呢?;) 谢谢。
     
  4. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    I will, however, only with permission from Dr. Xu and Mr. Jia. Programming is time-consuming and labor-intensive, it would be offensive to disclose anything against their wishes.
     
  5. xujiajin

    xujiajin 管理员 Staff Member

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Hi iCasino, please feel free to post your way of creating idx files for Sentence Collector.

    We will release the fully functional version of Sentence Collector in some time. The current version has some bugs that we don't like, but we don't have time to fix them at the moment.
     
  6. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Dr. Xu, thanks for your generosity and permission. Please understand that it is out of love of Sentence Collector that I peeped into the nooks and crannies of it.

    Great news for us all.

    -------------------------
    Next I will talk about my way of creating idx files for Sentence Collector 1.0 (hereafter SC).
     
  7. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Data structures lie at the heart of any significant programming. For SC, it is the index files.

    Unfortunately, it is encrypted. If you open COLEN.dat, you will read gibberish, something like Chinese. But the search results tell us that the content should be English, so there must be some conversion involved.

    I happened to read an article long time ago, to make a long story short, it says that to make English look like Chinese, you need to change the higher bits of a character into 1 when they are encoded in binary format.

    So I decoded it accordingly (shame on me)...

    Then the picture is clear. The .dat file is simply a file created by BFSU NewWords Marker 1.0 with .idx as the extension, but now it is sorted and has got a new extension.

    The .idx file is, if you look carefully, simply a secondary index referring to the line numbers in the .dat file (which is itself an index).

    When you have all the files from BFSU NewWords Marker 1.0, what you need to do is to convert from this to that and sort around. If you are lucky, you can make some tools to help you to automate everything for you.

    As for the Europarl09 corpus, you delete all mark-ups, segment it with BFSU Sentence Segmenter 1.0 and then index it with BFSU NewWords Marker 1.0. And then repeat the procedure stated above.

    A final note: The beauty of SC lies in that it does not rely on any database to drive it; instead, the programmer chose to use self-made indexes and a big array to do everything. In science, nothing is secret.
     
  8. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Originally, I planned to use a database as the backend to make a tool like SC, but I dropped the idea very soon: When most of the functionality is already there in SC, why do you bother to reinvent the wheel? Isn't all good science build upon the shoulders of giants?

    Since Dr. Xu and Mr. Jia had released BFSU NewWords Marker 1.0 and BFSU Sentence Segmenter 1.0 into the public domain, there is good reason to believe that they want SC to be more widely used. So I risk to release my home-made tools for creating index files for SC, before the new SC is in good shape and kicking.
     

    附件文件:

  9. iCasino

    iCasino 普通会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    An anecdote is that before I was sure the .dat file is encypted like I had guessed. I did a more laborous work.

    I used UltraEdit to look at the hex number of COLEN.dat and made a word list of all hex numbers (yes, with Antconc); then I made a wordlist of all letters in COLEN (the plain text one). I found that the most frequent letter in plain English COLEN is e, but the the most frequent hex number in COLEN.dat is not 65 (the hex number of e). If you calcuate the difference, you will be more certain about how it is encypted.

    Corpus linguistics can also be useful in cryptology!

    But I guess there is no need to encrypt it as COLEN is already in public domain and not banned. I guess it is mainly as a mechanism against careless users to modify it manually so as to harm the functionality of the program.

    Finally, i want to express my gratitude to Dr Xu and Mr Jia -- they are always a great source of learning and a big challenge for you to exercise your mind.
     
  10. seanxpq

    seanxpq corpus explorer

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具


    Thanks to you all for your contributions and generosity!
     
  11. joe

    joe 初级会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    Very nice. Thank you all.
     
  12. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    不理解为什么写个软件让卡巴斯基当木马?即便是真的没木马,难道非要采用这种技术加壳吗?卡巴斯基在脱壳方面确实是世界一流,真的没木马吗? 还有这个软件为何不做成一个自定义语料库的,采用常见txt格式语料库文件
     
  13. 回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    我是一名中学老师,刚刚进来,但我非常喜欢这款软件,现在我找不到如何下载这款软件。我很希望能用于高中英语教学,请求帮助。
     
  14. yjlm2001

    yjlm2001 初级会员

    回复: BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具

    软件很好用,只不过保存的时候不管选择哪个最后都是挖词填空的形式,保存后红颜色的单词是什么意思?生词?能更换其他的库么?还请许博士指教。
     
  15. 请问许博士
    BFSU Sentence Collector 1.0 基于语料库的英语例句提取工具正式版本有了吗?

    急需用于高中英语教学