关于 WordSmtih 检索 LCMC 总词数

jiji

普通会员
我使用WordSmith检索LCMC 下character文件夹下的15个文件,得表如下:

2006051219085739.gif


试问Dr.Xiao, 43025是不是指总词数?
 
点击附图中的“statistics”,找到token下的数字,这便是一般说的总词数。
 
以下是alphabetical排列:

2006051220241859.gif


前三四百项全是如此符号,如何操作会只显示汉字?是不是只能手工编辑,然后再手工计算一下?


另外,前几项中有两个A, 查看一下,原来字体不同。这好像与case无关,不知如何解释,如何除去?

[本贴已被 作者 于 2006年05月12日 20时39分30秒 编辑过]
 
wordlists made using Wordsmith or xaira include the words in the corpus header.

A frequency list I made from LCMC is downloadable here -

http://www.ling.lancs.ac.uk/corplang/zipfiles/LCMC_wordlist.zip

Number of types: 45435
Number of tokens: 999824

excluding corpus header.
 
It's kind of strange that the types shown in the above figure is 43,024 (including corpus header) while the number Dr. Xiao provides is 45,435 (excluding corpus header). Why?
 
是不是可以说:利用WordSmith4 编制Wordlist及keywords时,两个语料库最好是已分词但尚未标注的语料库?如此就不会有那么多乱七八糟的符号. 或者,将这些tags编入stop list?

[本贴已被 作者 于 2006年05月13日 09时41分31秒 编辑过]
 
回复:关于LCMC总词数

WordSmith不吃 Txt格式的Wordlist, 不知如何转化为.lst文件?
 
回复:关于LCMC总词数

You can only use Wordsmith generated wordlists to extract keywords.

以下是引用 jiji2006-5-13 10:26:35 的发言:


WordSmith不吃 Txt格式的Wordlist, 不知如何转化为.lst文件?
 
Whe you make a wordlist of LCMC, you can choose to cut the Header section ("Only part of the file") and ignore tags <*>. In this way, the word list does not include the words in the header section. There are some words consiting of English letters. This is normal as they are words in the corpus (e.g. in scientific notations in text category J).

Here is the LCMC wordlsit I have made:
http://forum.corpus4u.org/upload/forum/2006051323453475.zip
 
Thanks, Dr. Xiao.

I encoutered another problem when I attempted using Xaira to query LCMC.
Using Word Query "就",I got
就 3477
就是 980
就要 121
就业 21
就算 20

However, when I tried to use Phrase Query, I got
就是 1
就要 0
就业 0
就算 0

Could you see why?
 
For Chinese data, the phrase query (i.e. Quick query) only searches for a "phrase" consisting of single characters separated by white spaces. Strange? But it is true. This feature is not very useful (but it is for English). You should use the Word query in this case - or the query builder if you to search all of them at one go.
 
Putting one or two spaces between two Chinese characters in Phrase Query will make Xaira either crash or churn out 'no solutions'. Anyway, your words help solve this puzzle. Thanks.
 
Back
顶部