本詞頻基于四千七百万汉字(三千三百万词)的电影电视字幕和ICTCLAS汉语分词技术。
基于中文詞匯word naming和lexical decision的實驗數據,与现存幾個词频表的詞頻进行了比较,显示這些詞頻对RT的解释作用最优。
這里我們提供三個頻率表的完全版本的免費下載,供非盈利的學術研究交流:
http://expsy.ugent.be/subtlex-ch/
具体实现方法和细节,请参考和引用:
Cai, Q. & Brysbaert, M (in press). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. PLoS ONE.
暂时还没 online, 所以把final version的manuscript放在附件里了。abstract也贴在下面了。
已经通过私人email获得之前版本词频和文献的朋友请update一下。
词频为文本格式。比较容易的使用办法是 下载存盘- 打开excel - 从excel里打开 - Original data type 选Delimited(上面一个); File Origin选中文简体。使用其他program调用,请参考paper里的figures,有介绍格式。
其他应该在文章里都介绍了,有问题mail我: miao.cai@gmail.com
希望对诸位有用。祝诸位快乐顺利安好~~
简介
Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.
Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.
Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.
基于中文詞匯word naming和lexical decision的實驗數據,与现存幾個词频表的詞頻进行了比较,显示這些詞頻对RT的解释作用最优。
這里我們提供三個頻率表的完全版本的免費下載,供非盈利的學術研究交流:
http://expsy.ugent.be/subtlex-ch/
具体实现方法和细节,请参考和引用:
Cai, Q. & Brysbaert, M (in press). SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles. PLoS ONE.
暂时还没 online, 所以把final version的manuscript放在附件里了。abstract也贴在下面了。
已经通过私人email获得之前版本词频和文献的朋友请update一下。
词频为文本格式。比较容易的使用办法是 下载存盘- 打开excel - 从excel里打开 - Original data type 选Delimited(上面一个); File Origin选中文简体。使用其他program调用,请参考paper里的figures,有介绍格式。
其他应该在文章里都介绍了,有问题mail我: miao.cai@gmail.com
希望对诸位有用。祝诸位快乐顺利安好~~
简介
Word frequency is the most important variable in language research. However, despite the growing interest in the Chinese language, there are only a few sources of word frequency measures available to researchers, and the quality is less than what researchers in other languages are used to.
Following recent work by New, Brysbaert, and colleagues in English, French and Dutch, we assembled a database of word and character frequencies based on a corpus of film and television subtitles (46.8 million characters, 33.5 million words). In line with what has been found in the other languages, the new word and character frequencies explain significantly more of the variance in Chinese word naming and lexical decision performance than measures based on written texts.
Our results confirm that word frequencies based on subtitles are a good estimate of daily language exposure and capture much of the variance in word processing efficiency. In addition, our database is the first to include information about the contextual diversity of the words and to provide good frequency estimates for multi-character words and the different syntactic roles in which the words are used. The word frequencies are freely available for research purposes.
附件
Last edited: