关于自建小型语料库的文本整理问题

各位前辈,想请教一下关于自建小型语料库的问题,我想知道如果语料来自工科论文,正文中有较多希腊字母,及一些公式表达,这些应该怎么处理呢?如果将其删除的话,又会影响一个句子的完整性,这样会不会影响到赋码,然后进行句法分析呢?
比如说这样的句子:Letting αi(·)=log(|Hi(·)), it can be shown that a generalization of (4) with a real two-tone stimulus is expressed as (5) where p=m+n|, and k1 and k2 represent the frequencies of the two different tones, with k1,k2>0.
如果是做被动语态分析,该怎样处理呢?
 
回复: 关于自建小型语料库的文本整理问题

it is a tough question for any corpus linguist. For me, i've replaced all these formulas and equations with <>.
 
Back
顶部