紧急求助！国家语委语料库的词性统计数据

hiwendy · 2009-07-12

各位大虾好！

我在《国家语委语料库科研成果简介》中看到了主要词类的统计图表，但可惜没有具体数字和所在语料库中所占比例，于是就用笨办法做：根据说明，用/*可以检索特定词性，例如，我可以通过检索/n、/nr、/ns等所有名词形式出现的频率，然后将其相加得出名词的频次，再除以2000万就能得到它们所占的比例。但是结果似乎并不符合实际：

词性频次比例%

名词 2022692 10.11
动词 1649610 8.25
形容词 545331 2.73
数词 339184 1.70
量词 150934 0.75
副词 558142 2.79
代词 487429 2.44
小计 5753322 28.77

介词 381194 1.91
连词 315580 1.58
助词 946883 4.73
小计 1643657 8.22

就是说，主要词类加起来才刚刚超过全部语料的1/3，这太不可思议了！

一定是我的检索方法哪里出了问题。恳请大家教我！

另外，请问如果要查询除了“非”、“副”、“准”之外的其他前缀，应该怎样书写表达式？

Millions of thanks!!!

hiwendy · 2009-07-12

回复: 紧急求助！国家语委语料库的词性统计数据

菜鸟继续请教：

我自己做了个用ICTCLAS进行分词标注的汉语库，并且已经转为unicode编码。可不可以在wordsmith里用下面的表达式搜索特定词性？

==*/n*== 名词
==*/v*== 动词
==*/a*== 形容词
==*/m*== 数词
==*/q*== 量词
==*/d== 副词
==*/r*== 代词
==*/p*== 介词
==*/c*== 连词
==*/u*== 助词

如果不行的话，要是想知道各种词性的统计数据该怎么办呢？

还有我想用“/”来检测总词数，只有1000多，可实际上有50万字，用国家语委字词频率统计工具MyZiCiFreq是33万多，应该差不多。只是因为这个工具是自己分词，我怕标准和ICTCLAS不同会引起数据来源不一致，因此可靠性下降，这个问题有没有办法解决？

急用数据。谢谢各位高手指点迷津！！！

xiaoz · 2009-07-12

回复: 紧急求助！国家语委语料库的词性统计数据

You can't use */n* for nouns using WordSmith - the slash / means "or" in WordSmith. You will need to convert the annation style from word/tag to word_tag and then search for *_n* for nouns.

hiwendy · 2009-07-12

回复: 紧急求助！国家语委语料库的词性统计数据

Thank you very much, Prof. Xiao!

But how can I manage to do that? Prof. Liang said Powergrep was too powerful sometimes, and I dare not use it. Are there other softwares--fast and efficient? By the way, I have prepared a little tool which helps me to POS tag all the materials and convert them into unicode at the same time. Now all the ready corpus materials POS-tagged are in unicode, and TextFile from http://www.clqsoft.com does not work!

Thank you again!

xusun575 · 2009-07-12

回复: 紧急求助！国家语委语料库的词性统计数据

作者 hiwendy:
Thank you very much, Prof. Xiao!

But how can I manage to do that? Prof. Liang said Powergrep was too powerful sometimes, and I dare not use it. Are there other softwares--fast and efficient? By the way, I have prepared a little tool which helps me to POS tag all the materials and convert them into unicode at the same time. Now all the ready corpus materials POS-tagged are in unicode, and TextFile from http://www.clqsoft.com does not work!

Thank you again!

Please try what is indicated below with help of MS Word

xiaoz · 2009-07-12

回复: 紧急求助！国家语委语料库的词性统计数据

I think I uploaded some Perl script to this site some time ago that convert between various annotation types including the slash style, the underscore style, the BNC style, and the XML format. You can search the site for the scripts.

superyangt · 2009-07-12

回复: 紧急求助！国家语委语料库的词性统计数据

我想你的办法并不笨，而且应该是对的。也许你该再仔细看看每个词性下的小类，确保每个小类都计算进去了，尤其是名词和动词。

hiwendy · 2009-07-13

回复: 紧急求助！国家语委语料库的词性统计数据

作者 xiaoz:
I think I uploaded some Perl script to this site some time ago that convert between various annotation types including the slash style, the underscore style, the BNC style, and the XML format. You can search the site for the scripts.

Thanks a lot! I hope it will work!

hiwendy · 2009-07-13

回复: 紧急求助！国家语委语料库的词性统计数据

作者 superyangt:
我想你的办法并不笨，而且应该是对的。也许你该再仔细看看每个词性下的小类，确保每个小类都计算进去了，尤其是名词和动词。

谢谢您！可是我参照了该语料库说明，小类应该都计算进去了的，就是对结果很迷茫

感觉几乎每句话都有名词和动词，怎么会才那么点？还不到20%！

传媒语言语料库网上自动统计仅名词和动词就占到一半！
名： 186790 28.65
动： 138509　 21.25

我自己的库也是名词和动词超过了三分之一的。

不过，结果若没错的话，倒是俺的一个新发现哟！

可是怎么进行合理解释呢？

hiwendy · 2009-07-18

回复: 紧急求助！国家语委语料库的词性统计数据

找到原因啦！

人家的2000万是字数，不是词数！用名词数除以总字数当然是nonsense。
吃一堑长一智，呵呵

xujiajin · 2010-11-24

回复: 紧急求助！国家语委语料库的词性统计数据

http://www.cncorpus.org/

国家语委语料库介绍

紧急求助！国家语委语料库的词性统计数据

hiwendy

hiwendy

xiaoz

永远的超级管理员

hiwendy

xusun575

高级会员

附件

xiaoz

永远的超级管理员

superyangt

hiwendy

hiwendy

hiwendy

xujiajin

管理员