1. 发垃圾贴的禁不掉,只能关闭论坛注册。想要注册账户的C友,请发邮件到 aihaiyang at gmail dot com,我手动帮你创建。
    排除通知

关于专门用途语料库的文本清理问题

本文由 daqianqi1232016-08-04 发表於 "专门用途语料库" 讨论区

  1. 请教各位前辈:如何处理专门用途语料库中一些公式符号的问题,保存为txt格式的文件,那些公式就成乱码了,可否将文本中出现的公式做删除处理,这样对于语料研究的影响大不大,或者有什么更好的处理方法,谢谢啦!
     
  2. Formulas are annoying for corpus building indeed. Since they lose their meaning in pure text, deleting them from the corpus text will be fine. However, this is not true if the theses are to be retrieved in more advanced, or multimodal ways. The answer to your question depends on the nature of your investigation, and the extent to which those formulas are significant for you.
     
  3. I would replace them with a symbol, for instance, "FML" . This will preserve the integrity of sentence structure as formulars are often part of a sentence.