用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

corpora · 2008-07-08

用Wordsimith4检索CLEC的子集st3发现数据有疑问，检索到的tokens (running words ) in text 为 259525；token used for wordlist 232542.问这两个的区别是什么？？？
另外我从桂诗春关于clec的介绍里写的是st3的库容为209043，（我理解为running words, 即tokens）我写论文该采用哪一个数据呢？我很糊涂？是否有人帮我解释一下。非常感谢。

hittle2008 · 2008-07-08

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

1.running words in text 就是该库的总形符，根据你对形符的定义不同，它的数值应大于或等于tokens used for wordlist （实际上就是词表中所有类符的总和）．一般去掉标注和stoplist中的形符，两个数字应是相等的．
2.采用哪一种数据并不重要，关键是要保证你所有涉及到相关数据的地方都用同一数据源的数据，并且注明出处，统计方法．
另外我的统计和你的好像有点不一样，你看看：

xujiajin · 2008-07-08

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

作者 corpora:
另外我从桂诗春关于clec的介绍里写的是st3的库容为209043

这个数字应当是出去其中方括号中的错误代码之后的实际字数。

hittle2008 · 2008-07-08

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

许博士说得有道理,你在stoplist 加上方括号和所有错误代码试试看，我的统计里没有加

xiaoz · 2008-07-08

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

All numerals are collapsed into the category # in the wordsmith wordlist, though they are counted as running words in the text.

corpora · 2008-07-09

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

感谢各位的帮助，获益匪浅。但是hittle2008，奇怪的是，我试了很多次，我检索到的结果的tokens (running words ) in text ；和token used for wordlist 数值都有很大差异，（无论设置stoplist和不设置stoplist情况下）
另外，我在sopplist里设了比如the这个单词，检索的结果，只是frequency列表里，没有了the的频次，而总的statistics的信息和不设sopplist的信息完全一样。即统计结果完全一样，这是怎么回事啊？？！！

hittle2008 · 2008-07-11

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

应该是你的设置有问题

hazhihan · 2008-07-14

Re: 回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

作者 xiaoz:
All numerals are collapsed into the category # in the wordsmith wordlist, though they are counted as running words in the text.

谢谢，还有些发现：
WST4中的tokens in text 统计标点，但把连续标点算作一个，数字也一样；而tokens used for word list 不统计标点，也把连续数字算做一个。

试统计 200..5,
tokens in text 是4个（200/../5/,）
tokens used for word list 是2个（200/5）

望指正

hittle2008 · 2008-07-15

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

奇怪,为什么我的两个数字都一样呢,均为2

hittle2008 · 2008-07-15

回复: 用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

作者 xiaoz:
All numerals are collapsed into the category # in the wordsmith wordlist, though they are counted as running words in the text.

This happens only in the case when you don't check the box "numbers in wordlist" under the settings of "Languages".Once checked, it would make each number in the text counted as a specific token, and the number of "tokens (running word) in text" is in most cases equal to that of "tokens used for wordlist".

用Wordsimith4检索CLEC的子集st3发现数据有问题？请指教。

corpora

初级会员

hittle2008

附件

xujiajin

管理员

hittle2008

xiaoz

永远的超级管理员

corpora

初级会员

hittle2008

hazhihan

初级会员

hittle2008

hittle2008