关于专门用途语料库的文本清理问题

本文由 daqianqi1232016-08-04 发表於 "专门用途语料库" 讨论区

  1. 请教各位前辈:如何处理专门用途语料库中一些公式符号的问题,保存为txt格式的文件,那些公式就成乱码了,可否将文本中出现的公式做删除处理,这样对于语料研究的影响大不大,或者有什么更好的处理方法,谢谢啦!
     
  2. Formulas are annoying for corpus building indeed. Since they lose their meaning in pure text, deleting them from the corpus text will be fine. However, this is not true if the theses are to be retrieved in more advanced, or multimodal ways. The answer to your question depends on the nature of your investigation, and the extent to which those formulas are significant for you.
     
  3. I would replace them with a symbol, for instance, "FML" . This will preserve the integrity of sentence structure as formulars are often part of a sentence.