[Download] Frequency lists for top 5000 Chinese words and top 2000 characters

xiaoz

永远的超级管理员
Staff member
#1
Here are the frequency lists of the top 5000 Chinese words and the top 2000 Chinese characters covered in the just published frequency dictionary of Mandarin Chinese (http://www.routledge.com/books/A-Frequency-Dictionary-of-Mandarin-Chinese-isbn9780415455862). These lists are based on a balanced corpus of ca. 50 million words (or ca. 73 million chinese characters).

In addition to the normalised frequency (normalised to per million words / characters for the character list), I have included here the usage rate and dispersion rate for both word list and character list. (The published dictionary does not include such statistics for the character list). For a discussion of these concepts and the rationale behind them, for a discussion of the relationship between the lists and the HSK lexical syllabus, or for a presentation of the corpus data, please refer to the Introduction chapter of the book.

These lists can be referenced in your own research by citing the above Routledge frequency dictionary.
 

附件

armstrong

高级会员
#2
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

thanks a lot, Dr.Xiao. They are very scientific wordlist and characterlist.
Personally, they will contribute to the teaching and learning of Chinese.
 
#3
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

好东西,收藏了,谢谢肖博士!
 
#5
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

祝贺Dr. Xiao新书付梓。
词表收藏!
 
#6
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

谢谢分享! 有两个问题请教:
1)文件中只见5000词表没有2000字表,是否漏贴了?
2)“Frequency per M words” 数据似乎不准, 将前1500条累计,就已经2.89M了。
 

xiaoz

永远的超级管理员
Staff member
#7
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

1) There are two spreadsheets in the Excel workbook - just click on the other tab to view the character list;

2) It appears you have no idea of how normalised frequency is computed. Search this website for the related statistic knowledge.

谢谢分享! 有两个问题请教:
1)文件中只见5000词表没有2000字表,是否漏贴了?
2)“Frequency per M words” 数据似乎不准, 将前1500条累计,就已经2.89M了。
 
#10
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

肖博士您好,感谢你在汉语字、词频率上作出的最新研究贡献。我想做汉语言认知(实证)方面的研究,但苦于找不到比较新的汉语词频的语料库,国内能够查阅的是几本90年以前出版的汉语词频字典,其中收藏的很多词项的频率已经发生了巨大变化,而且很多现在常用词也都没有收录。请问肖博士,您的这项研究,除了包括这5000个频率比较高的词项外,还有没有统计其他的低频汉字词(覆盖面比较较广的汉字词频率语料库)。我们采用的实验词汇,有一些频率不是很高,很难查到其具体的词频。谢谢!
 

xiaoz

永远的超级管理员
Staff member
#11
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

Top 50,000 Chinese word frequency list - with individual and accumulated proportions. This list is based on raw frequencies, which is not adjusted with the usage rate to reflect dispersion across registers.
http://www.lancs.ac.uk/fass/projects/corpus/data/top50000_Chinese_words.zip

Full frequency list of Chinese characters:
http://www.lancs.ac.uk/fass/projects/corpus/data/Chinese_character_frquency_list.zip
 
#12
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

"based on a balanced corpus of ca. 50 million words " 可以知道多一点这个来源语料库吗?谢谢。
 
#15
请问老师:那个语料库的总词语条数是多少?

谢谢你的资料和指教,那5000中文词语均有词频,但我还想知道用于统计语料库的总词语条数是多少,我没法计算出来,是我没找到方法吗?先谢谢指教!
 

xiaoz

永远的超级管理员
Staff member
#16
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

Here are some statistics of the lexicon size:

With separate POS: 108,303 types (e.g. hui4 as a noun and as a verb are counted as two types)

With combined POS: 84,883 types (e.g. hui4 as a noun and as a verb are combined and counted as one type)

The above figures exclude items that are "uninteresting" from the lexicographic perspective, e.g. punctuations, symbols, Arabic numerals (written in either full- or half-length), and non-Chinese character strings.
 
#17
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

Dear xiao
Thank you very much for your kind help,you see,you have already given the rate of selected words and the total amount of words in corpora for 50000-word list,but not for 5000-word list(I don't know how to computer it),I just want to know that......Sorry to trouble you......
hongtao
 

xiaoz

永远的超级管理员
Staff member
#19
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

I don't know what you want to compute and what you mean by the "amount of words". The lexicon size I gave in my previous message is for the whole 50 million word dataset. For the top 5000 word list, there are obviously 5000 different word types, and for the top 50000 word list, there are 50000 words types.

Dear xiao
Thank you very much for your kind help,you see,you have already given the rate of selected words and the total amount of words in corpora for 50000-word list,but not for 5000-word list(I don't know how to computer it),I just want to know that......Sorry to trouble you......
hongtao
 
#20
回复: [Download] Frequency lists for top 5000 Chinese words and top 2000 characters

Dear xiao
I am really ignorant in this field,perhaps I just want to know how you compute the so-called (word probability)......
hongtao
 
顶部