
Collocations in non-interpreted and simultaneously interpreted English: a corpus study 一文中提到了Bernardini提出的13个英语中常见的POS,作者使用语料库工具自动从语料库中提取了符合这13个POS的词组,但是文中并没有仔细说明是如何提取的。今天尝试使用香港浸会大学的CEPIC口译语料库提取,但是好像没有办法实现。想问问有没有相关书籍讲到该提取方法呢?求指点!
“Following Bernardini (2015:529), I extracted the same 13 POS patterns which she hypothesised to be common in English. All the word forms which fit the patterns above were automatically extracted from the EnOr and EnSI components of the SIREN corpus. Since the interpreted component is 17% bigger than the original, the search of EnSI yielded more hits than that of EnOr. To compensate for the disparity, the longer list of wordforms was randomly trimmed to the length of the shorter one. After that duplicates were removed from both lists, resulting in two lists of POS chain types characteristic, respectively, of interpreted and comparable non-interpreted English. "
以上内容为Collocations in non-interpreted and simultaneously interpreted English: a corpus study一文中对该方法的描述。