求助:文本切割工具

melia · 2007-03-17

type/token ratio的结果有效的前提是语料中的文本长度相同.不知道是否有一种工具可以将文本切割成等同长度? 谢谢!

laohong · 2007-03-17

回复: 求助:文本切割工具

这里有个可以分别按字数或行数分割文本文件的小程序：

http://www.ddooo.com/softdown/17920.htm

不过，用这样的做法来比较TTR是有问题的。

melia · 2007-03-17

回复: 求助:文本切割工具

恩,文本不是完整的.不管怎样,我试试看.
thanks, laohong!

xiaoz · 2007-03-17

回复: 求助:文本切割工具

I don't think you need to actually segment texts into parts of equal size. Mike Scott introduced "standardised TTR" into WordSmith to make TTRs of texts of different sizes more comparable.

melia · 2007-03-17

回复: 求助:文本切割工具

oh i see
but how the standardised TTR is computed? any thought?

xiaoz · 2007-03-17

回复: 求助:文本切割工具

If a text is 1,000 words long, it is said to have 1,000 "tokens". But a lot of these words will be repeated, and there may be only say 400 different words in the text. "Types", therefore, are the different words.

The ratio between types and tokens in this example would be 40%. But this type/token ratio (TTR) varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a TTR of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases, though it is supplied in a WordList statistics display. The conventional TTR is informative, of course, if you're dealing with a corpus comprising lots of equal-sized text segments (e.g. the LOB and Brown corpora). But in the real world, especially if your research focus is the text as opposed to the
language, you will probably be dealing with texts of different lengths and the conventional TTR will not help you much.

Wordlist uses a different strategy for computing this, therefore. The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)

See also
http://forum.corpus4u.org/showthread.php?t=353&highlight=standardised+TTR

xujiajin · 2007-03-18

回复: 求助:文本切割工具

作者 laohong:
这里有个可以分别按字数或行数分割文本文件的小程序：

http://www.ddooo.com/softdown/17920.htm

不过，用这样的做法来比较TTR是有问题的。

Yes. A line is not the same with a sentence.

求助:文本切割工具

melia

初级会员

laohong

管理员

melia

初级会员

xiaoz

永远的超级管理员

melia

初级会员

xiaoz

永远的超级管理员

xujiajin

管理员