

用SARA算BNC里的一些搭配词的z值,放到专门算Z-score 的软件CalcZ里检验,结果差得很远,这是怎么回事啊?求高人指点了!!

Can you provide us here with the raw data you used and perhaps also how you calculated the Z-score?

Without this kind of information, how can anyone help you?
perhaps different tools define "word" differently, the most important is that the ranks of the words you get from the same tool. with which one can compare the collocability of a specific word with its collocations. you can consult Dr.Barlow's analysis.
to 动态语法: I use BNC as the control corpus for my comparative study. Since it's monstously large, I only selected 500 instances randomly out of around 17700 occurrences of the word under discussion. But all these are processed by SARA, the typicality of the radomly selected concordances are questionable. So far, I haven't found a way to copy the corcondance lines from SARA to a word file. I'm sorry, I can't provide the raw data. The software I use to calculate Z-score requires 5 pieces of information, viz. C1 节点词与搭配词共现次数, C2 搭配词的出现频数, S 默认为10, Cs 语料库总词容, n 节点词出现频数. It's a simple software, actually, to save human laborious efforts.
to 刘语料: I'm sorry, but I'm not very clear what you referred to. Can you explain more specifically? Thank you!
Which version of the BNC are you using? Version 2 or the world Edition. The online SARA uses the old version 2 and you are using the World Edition?

If you let me know which word you are studying in your example, perpahs I will be able to check it for you.

以下是引用 ibid2005-10-8 23:39:25 的发言:
to 动态语法: I use BNC as the control corpus for my comparative study. Since it's monstously large, I only selected 500 instances randomly out of around 17700 occurrences of the word under discussion. But all these are processed by SARA, the typicality of the radomly selected concordances are questionable. So far, I haven't found a way to copy the corcondance lines from SARA to a word file. I'm sorry, I can't provide the raw data. The software I use to calculate Z-score requires 5 pieces of information, viz. C1 节点词与搭配词共现次数, C2 搭配词的出现频数, S 默认为10, Cs 语料库总词容, n 节点词出现频数. It's a simple software, actually, to save human laborious efforts.

No, I am not interested in your concordance lines, I meant raw numbers.
Can you give these numbers: C1 节点词与搭配词共现次数, C2 搭配词的出现频数,
S 默认为10, Cs 语料库总词容, n 节点词出现频数? (By the way, what's 'S 默认为
10'? Window span? Do both programs have it as the default setting?)

Just like scientists doing experiments, you want others to have the exact same
raw data to be able to replicate your results. Without the raw data we don't know
what you are talking about and how to respond to you.
to xiaoz and 动态语法: I'm using BNC world edition released in 2000. Let's take 'everyone' for example: the collocate word 'else' co-occurs 1153 times with 'everyone'; 'everyone' and 'else' occur 12786 and 19931 times respectively; S is the window span, we set it as 10 (5 left, 5 right); and the total number of words of BNC is 100,000,000. The Z-score given by BNC is 237.1, while putting all these data into the Z-score software CalcZ, the result is 150.3. Now you can see this is my question.
Besides, if I choose 500 concordances randomly by BNC, and 'else' co-occurs 36 times with 'everyone', then I calculate Z-score within downloads only, the number I get is 37.3. Why it is such a far cry from the one, 237.1, in the whole corpus?
Another question, does anyone know how to save concordance lines in a word file? Now I can only save the lines in xml format, then copy to a word file. But there are a lot of tags I need to get rid of.

以下是引用 ibid2005-10-9 14:15:35 的发言:
to xiaoz and 动态语法: I'm using BNC world edition released in 2000. Let's take 'everyone' for example: the collocate word 'else' co-occurs 1153 times with 'everyone'; 'everyone' and 'else' occur 12786 and 19931 times respectively; S is the window span, we set it as 10 (5 left, 5 right); and the total number of words of BNC is 100,000,000. The Z-score given by BNC is 237.1, while putting all these data into the Z-score software CalcZ, the result is 150.3. Now you can see this is my question.

You need to look at the documentation about CalcZ (what is it, by the way?) and
see what formula the author uses. I got a different score (with ACWT), too, based
on the formula given by BNCWeb.


The corpus size may be the problem: Does the program uses
100,000,000 or something else as the corpus size? This may cause different

Now question #2:
以下是引用 ibid2005-10-9 14:15:35 的发言:
Besides, if I choose 500 concordances randomly by BNC, and 'else' co-occurs 36 times with 'everyone', then I calculate Z-score within downloads only, the number I get is 37.3. Why it is such a far cry from the one, 237.1, in the whole corpus?

Well, when you change the sample size, you get different frequency info and yet
you assume the same base numbers (corpus size, freq. of node, freq. of collocate,
etc.). The results are bound to be different. It's almost like asking: when you have
a total of 10 bananas, why 100 monkeys get fewer bananas than 10 monkeys do
and why can't they have the same number of bananas.
the calculation of Z-score needs five kinds of number, in fact ,any of them changes. the result will be different.
tools like Wordsmith 4.0 , Tact2.15 and PHC 1.02 ,define "word" differently, therefore, even if users use the identical corpus ,they tend to get different results.
actually. sometimes a tool uses 4/4 as the span,while others 5/5. so the results are different.
Dr.Barlow thinks that one should pay attention the ranks of collocates of a certain word instead of the value itself.
I use the above tools to study the same word"provide".the results are quite different.
tools are tools, we should explain the results according to linguistic facts.
Besides. the Z-score shoes it disadvantges, so the comibination of the Z-score , T-score, MI-score and other scores is of great importance.

The frequencies of everyone and else are the same as my result (BNC World), but the z score is 301.37 for +/-3. When you created a collocate database with the default settings (+/-5 span), it actually shows collocates within a span of +/-3. Try your method with S=6 to see if your result is the same as mine. [Actually if you try +/-5, else is not on the list at all.]

Many thanks to 动态语法 and xiaoz . To be honest, I don't have the documentation about calcz. My friend just sent me the software. And I've tried to google it, but found no useful info. I've noticed there is a discussion about ACWT on this web, I'll take a look at that later. As for WordSmitlh, I have the version 3.0, but it doesn't seem to be able to calculate Z-score. Or maybe I'm not in that.
my result is almost the same as yours: 299.9 with 1121 occurrences of "else". the other word forms else' (twice), else- (once), elses (3 times) are not taken into account.
now comes to the next question: i can only set S=6? (why don't I have the interface like yours shown above? there are several items under the "query" on top of the interface, i choose collocation, and calculate the collocate one by one.)
I am using BNCweb hosted at Zurizh.

you can also rey the BNC Online at British Library (but that is the old version, a bit larger than World Edition).


以下是引用 ibid2005-10-9 14:19:56 的发言:
Another question, does anyone know how to save concordance lines in a word file? Now I can only save the lines in xml format, then copy to a word file. But there are a lot of tags I need to get rid of.

it depends on what kind of concordancer you are using.
My study follows my supervisor's criteria. But I guess he hasn't adopted BNC for his studies. He used to define the span as +/-5, and the threshold of the significance of
Z-score 2.0. If, for my study, I set S=6, then what should the threshold be?