求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

yuliaoku · 2008-11-04

请高手指点一下：WS4和WS5的WordList给出的Statistics中的

Standard TTR std. dev.

是如何计算的？我用手算出来的和该程序自动算出来的不一样。

谢谢！

清风出袖 · 2008-11-04

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

敢问楼主是怎么算的？谢谢了。

yuliaoku · 2008-11-04

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

因为从WS的帮助文件中找不到，所以自己尝试了一下，不知道对不对，请指教。

假如一个语料库由五篇小说构成，五篇小说的STTR分别为62.1、43.5、48.3、57.6和55.7。然后套用韩宝成《外语教学科研中统计方法》中第22页上的标准差公式计算，所得结果与WS的结果相差非常大。我知道是不对的，所以求教于大家。

hittle2008 · 2008-11-05

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

楼主你的结果是多少呢?直接用统计函数STDEV,根据你的数据算出来是7.46

yuliaoku · 2008-11-05

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

谢谢你的帮助。

我手算的结果和你的是一样的，也是7.460。但是WS4在Standardized TTR std. dev. 一栏给出的结果大约是53.00（精确数值记不清楚了）。好像是这5个数值的平均值，而不是这5个数值的标准差。

用任何几个文本重复，结果都相差悬殊，是不是我的理解有误？这个数值到底是说明什么的？

清风出袖 · 2008-11-05

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

from WS 4 help manual

If a text is 1,000 words long, it is said to have 1,000 "tokens". But a lot of these words will be repeated, and there may be only say 400 different words in the text. "Types", therefore, are the different words.

The ratio between types and tokens in this example would be 40%.

But this type/token ratio (TTR) varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a TTR of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases, though it is supplied in a WordList statistics display. The conventional TTR is informative, of course, if you're dealing with a corpus comprising lots of equal-sized text segments (e.g. the LOB and Brown corpora). But in the real world, especially if your research focus is the text as opposed to the language, you will probably be dealing with texts of different lengths and the conventional TTR will not help you much.

Wordlist uses a different strategy for computing this, therefore. The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)

Setting the N boundary

Adjust the n number in Minimum & Maximum Settings to any number between 100 and 20,000.

What STTR actually counts

Note: The ratio is computed a) counting every different form as a word (so say and says are two types) b) using only the words which are not in a stop-list c) those which are within the length you have specified, d) taking your preferences about numbers and hyphens into account.

The number shown is a percentage of new types for every n tokens. That way you can compare type/token ratios across texts of differing lengths. This method contrasts with that of Tuldava (1995:131-50) who relies on a notion of 3 stages of accumulation. The WordSmith method of computing STTR was my own invention but parallels one of the methods devised by the mathematician David Malvern working with Brian Richards (University of Reading).

Further discussion

TTR and STTR are both pretty crude measures even if they are often assumed to imply something about "lexical density". Suppose you had a text which spent 1,000 words discussing ELEPHANT, LION, TIGER etc, and then 1,000 discussing MADONNA, ELVIS, etc., then 1,000 discussing CLOUD, RAIN, SUNSHINE. If you set the STTR boundary at 1,000 and happened to get say 48% or so for each section, the statistic in itself would not tell you there was a change involving Africa, Music, Weather. Suppose the boundary between Africa & Music came at word 650 instead of at word 1,000, I guess there'd be little or no difference in the statistic. But what would make a difference? A text which discussed clouds and written by a person who distinguished a lot between types of cloud might also use MIST, FOG, CUMULUS, CUMULO-NIMBUS. This would be higher in STTR than one written by a child who kept referring to CLOUD but used adjectives like HIGH, LOW, HEAVY, DARK, THIN, VERY THIN to describe the clouds... and who repeated DARK, THIN, etc a lot in describing them.....

(NB. Shakespeare is well known to have used a rather limited vocabulary in terms of measures like these!)

yuliaoku · 2008-11-06

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

Thanks a lot, 清风出袖!

I've read this part of the help manual. But it only gives information about STTR. Nothing is mentioned about the calculation of STTR std. dev. and its significance in corpus-based studies.

I shall be very pleased if you could provide me with some information about it.

作者清风出袖:
from WS 4 help manual

If a text is 1,000 words long, it is said to have 1,000 "tokens". But a lot of these words will be repeated, and there may be only say 400 different words in the text. "Types", therefore, are the different words.

The ratio between types and tokens in this example would be 40%.

But this type/token ratio (TTR) varies very widely in accordance with the length of the text -- or corpus of texts -- which is being studied. A 1,000 word article might have a TTR of 40%; a shorter one might reach 70%; 4 million words will probably give a type/token ratio of about 2%, and so on. Such type/token information is rather meaningless in most cases, though it is supplied in a WordList statistics display. The conventional TTR is informative, of course, if you're dealing with a corpus comprising lots of equal-sized text segments (e.g. the LOB and Brown corpora). But in the real world, especially if your research focus is the text as opposed to the language, you will probably be dealing with texts of different lengths and the conventional TTR will not help you much.

Wordlist uses a different strategy for computing this, therefore. The standardised type/token ratio (STTR) is computed every n words as Wordlist goes through each text file. By default, n = 1,000. In other words the ratio is calculated for the first 1,000 running words, then calculated afresh for the next 1,000, and so on to the end of your text or corpus. A running average is computed, which means that you get an average type/token ratio based on consecutive 1,000-word chunks of text. (Texts with less than 1,000 words (or whatever n is set to) will get a standardised type/token ratio of 0.)

Setting the N boundary

Adjust the n number in Minimum & Maximum Settings to any number between 100 and 20,000.

What STTR actually counts

Note: The ratio is computed a) counting every different form as a word (so say and says are two types) b) using only the words which are not in a stop-list c) those which are within the length you have specified, d) taking your preferences about numbers and hyphens into account.

The number shown is a percentage of new types for every n tokens. That way you can compare type/token ratios across texts of differing lengths. This method contrasts with that of Tuldava (1995:131-50) who relies on a notion of 3 stages of accumulation. The WordSmith method of computing STTR was my own invention but parallels one of the methods devised by the mathematician David Malvern working with Brian Richards (University of Reading).

Further discussion

TTR and STTR are both pretty crude measures even if they are often assumed to imply something about "lexical density". Suppose you had a text which spent 1,000 words discussing ELEPHANT, LION, TIGER etc, and then 1,000 discussing MADONNA, ELVIS, etc., then 1,000 discussing CLOUD, RAIN, SUNSHINE. If you set the STTR boundary at 1,000 and happened to get say 48% or so for each section, the statistic in itself would not tell you there was a change involving Africa, Music, Weather. Suppose the boundary between Africa & Music came at word 650 instead of at word 1,000, I guess there'd be little or no difference in the statistic. But what would make a difference? A text which discussed clouds and written by a person who distinguished a lot between types of cloud might also use MIST, FOG, CUMULUS, CUMULO-NIMBUS. This would be higher in STTR than one written by a child who kept referring to CLOUD but used adjectives like HIGH, LOW, HEAVY, DARK, THIN, VERY THIN to describe the clouds... and who repeated DARK, THIN, etc a lot in describing them.....

(NB. Shakespeare is well known to have used a rather limited vocabulary in terms of measures like these!)

清风出袖 · 2008-11-06

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

如果知道了STTR直接用Standar deviation公式吧，而且如果是样本要记得用N—1 来除。

yuliaoku · 2008-11-06

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

非常感谢您的及时回复！但是我还是不太明白，想请教一下：附件中是LOCNESS中3个文件（alevel4, alevel5, alevel6)的词表统计截图。能否请您看一下standardized TTR std. dev.的值54.81是如何计算出来的，这个值有何意义？

非常感谢！！！

作者清风出袖:
如果知道了STTR直接用Standar deviation公式吧，而且如果是样本要记得用N—1 来除。

yuliaoku · 2009-04-29

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

这个问题我至今也没弄明白。

Help！

作者 yuliaoku:
非常感谢您的及时回复！但是我还是不太明白，想请教一下：附件中是LOCNESS中3个文件（alevel4, alevel5, alevel6)的词表统计截图。能否请您看一下standardized TTR std. dev.的值54.81是如何计算出来的，这个值有何意义？

非常感谢！！！

stream · 2009-12-15

回复: 求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

我统计200篇口语语料的TTR, 结果那一列就是空白，这种情况是将其TTR算为0呢，还是要调整setting 的最大最小值呢？谢谢各位！

求助：WS4和WS5中的Standard TTR std. dev. 是如何计算的？

yuliaoku

初级会员

清风出袖

高级会员

yuliaoku

初级会员

hittle2008

yuliaoku

初级会员

清风出袖

高级会员

yuliaoku

初级会员

清风出袖

高级会员

yuliaoku

初级会员

附件

yuliaoku

初级会员

stream

普通会员