请教精通语料库的高手

brezeeboy · 2006-06-25

1.现在有没有英语词汇聚类的工具（因本人想作智能电子词典的研究，想实现自动把单词集分成多个“词卡”的功能）
2.有没有计算语料库中表示某个概念的单词占这个单词出现总数（比如表示“银行”的单词bank占bank出现总次数）的工具

laohong · 2006-06-25

问题1：计算语言学在词汇聚类方面的研究有很多，有很多算法可以试试。可以到股沟里搜搜，有很多文献可读。

问题2：要知道某个词性的词，如动词 head，占某个语料库中 head 这个词出现总次数中的比例，首先就得给语料做词性标注（POS tagging）。同理，要想做到概念层面的提取，就先得给语料做到这一层次的赋码。当然，这样做的计算成本就非常高了。因此，通过词汇搭配、语义网络、上下文语境等特征进行抽取会更省事些，如能在文本中通过上下文分辨出 river bank 和 money bank 就可以把概念为银行的 bank 拿出来的。这样做，自然也会因词而异。

xujiajin · 2006-06-25

第二个问题比较难：我知道的只有
West, Michael. (1953). A General Service List of English Words.
按词项区分，并列出使用频率和百分比。

xiaoz · 2006-06-25

Semantic taggers like Wmatrix might be of help in this connection.

xujiajin · 2006-06-25

I guess xiaoz is actually referring to USAS not Wmatrix.

USAS only differentiates the semantic categories of linguistic items, predominantly lexical items, but not necessarily different semantic options of a single lemma.

xiaoz · 2006-06-25

Yes. Wmatrix is a web interfeace for CLAWS and USAS.
Unfortunately, USAS does not make a distinction between the two meanings of bank in the following test example "There is a bank on the right bank of the river". Both instances of bank are tagged as I1 (Money generally).

brezeeboy · 2006-06-26

回复：请教精通语料库的高手

谢谢啦不过我不清楚“股沟”在哪里耶！？

刘语料 · 2006-06-26

“股沟”即搜索引檠"google".

laohong · 2006-06-26

听说官方翻译叫“谷歌”，个人认为还不如叫“股沟”，不为别的，用拼音打字时很方便地就自己出来了。

zephyr · 2006-06-27

google叫狗狗比较好
semantic tagging是不是语料库标注的发展方向？

请教精通语料库的高手

brezeeboy

初级会员

laohong

管理员

xujiajin

管理员

xiaoz

永远的超级管理员