MonoConc Pro与WordSmith 4.0统计数据相同否?


Title Corpus-Based Approaches and Discourse Analysis in Relation to Reduplication and Repetition
Author Wang, Shih-ping
Source Journal of Pragmatics, 2005, 37, 4, Apr, 505-540

the concordancing software packages, MonoConc Pro and WordSmith 4.0, were employed to generate frequency lists, concordances and collocation information. The built-in SARA software in the BNC was used to calculate mutual information scores for the probability of collocation.(p.518)

Both MonoConc Pro and WordSmith 4.0 may generate slightly different statistical results for the same file when running their programming function, which will be discussed elsewhere.(p.528)

还请手头有MonoConc Pro与WordSmith 4.0两种工具的C友试一下。

Concordancing oneself: Constructing individual textual profiles
David Coniam
International Journal of Corpus Linguistics 9:2 (2004) 271-298

Oakes (1998:28C29) quoting an example by Kilgariff (1997) on the use of chi-square
(one of the statistics reported by WordSmith Tools) states that as sample size
(i.e., corpus size) increases, all chi-square tests will indicate significant difference.
With corpora getting bigger and bigger C and expected to be bigger and
bigger C even a personal corpus of 300,000 tokens / 9,000 types is beyond the
bounds of such simple statistical analysis. It is not really worthwhile, therefore,
performing statistical tests such as chi-square, log-likelihood or correlations.

Kilgariff (2001) discusses the need for a metric to be able reliably to compare

Kilgariff, A. (1997). Using Word Frequency Lists to Measure Corpus Homogeneity and
Similarity Between Corpora. Proceedings of the Fifth ACL Workshop on Very Large
Corpora, Beijing and Hong Kong, August 1997.
Kilgariff, A. (2001). Comparing Corpora. International Journal of Corpus Linguistics, 6 (1),
Oakes, M. P. (1998). Statistics for Corpus Linguistics. Edinburgh: Edinburgh University Press.

是不是说,使用WordSmith统计BNC之类的库,其中的相关性分析根本不能用? 俺对统计学一知半解,还请诸位论道。
It is very likely that different programs (found WordSmith 4, Xaira, A Corpus Worker's Tool and MS word) return different numbers of matches especially when the search string is complex and the corpus is annotated because different programs may have used different searching algorithms.

What is important is that you should stick to one and the same tool in one study to ensure comparability.
The significance level returned by statistical tests such as chi-square and log-likelihood scores are closely related to the sample sizes. That's why I have advised in Corpus Based Language Studies not to artificially inflate or deflate the common base for normalisation when comparing corpora.

However, before more scientific metrics have been developed, corpus linguists have either to live with these standard tests or to leave the profession.
Thanks, Dr. Xiao. 第一点我很认同。使用同一工具和同一语料库是具有可比性的。第二点是不是说:这是统计学家的事?We can do nothing?
Not necessarily a statistician's task. You can of course propose a statistical formula to be tested if you like.

When a statistical formula is found to be inadequate for a particular purpose, a new one is developed and used. For example, when parametric t- tests are found to rely upon normal distribution, people start to use log-likelihood tests; when chi-sqaure or loglikelihood tests are found to be unreliable with low frequencies, peopls start to use Fisher's Exact Test; when MI and z-scores for collocation statistics are found to unduly over-emphasize infrequent items, MI3 (cubic MI) log-log measures are developed and used, or a minimum co-occurrence frequency is specified to reduce such an undesirable effect.
再请教几个术语问题。cluster、chunk、lexical bundle、n-gram这几个术语是否所指相同?在下手头资料有限。
you can search for n-gram, lexical bundle etc at this site. There were some discussions in this area.
求助:用 Barlow教授的软件Mono Pro 2.2如何检索XML格式中的形容词前置修饰名词,比如在下面的语料中如何检索出<w POS="JJ">Superior</w> <w POS="NN">Court</w>,请问在tag setting中如何设置?在tag search中的search term 应是什么?谢谢!

SA01:3 <w POS="AT">the</w> <w POS="NP">September-October</w> <w POS="NN">term</w> <w POS="NN">jury</w> <w POS="HVD">had</w> <w POS="BEN">been</w> <w POS="VBN">charged</w> <w POS="IN">by</w> <w POS="NP">Fulton</w> <w POS="JJ">Superior</w> <w POS="NN">Court</w> <w POS="NN">Judge</w> <w POS="NP">Durwood</w> <w POS="NP">Pye</w> <w POS="TO">to</w> <w POS="VB">investigate</w> <w POS="NNS">reports</w> <w POS="IN">of</w>
我试了试可以用AntConc检索出来,结果准确,但是用WST3.0 或WST4.0检索结果有误,用Barlow教授的软件不知如何设置,但用Collocate软件可以检索T值,MI值,LL值。希望熟悉Barlow教授的软件的C友帮忙解决此问题。