What corpora cannot do? 语料库语言学家十诫

xujiajin

管理员
Staff member
What corpora cannot do? 语料库语言学家十诫
我从接触语料库开始就一直在思考这个问题。随着语料库及其方法在语言研究中的大行其道,颇有一种舍我其谁的架势。所以在此提出这个问题同诸位共同探讨语料库语言学的不足之处。

什么东西语料库解决不了或者解决不好?


PS:今天读了一篇题为Today's corpus linguistics: Some open questions的文章。文章的末了,作者给出了一个“语料库语言学家十诫”,很有意思,也附在这里供大家讨论。
Ten Commandments for computational (and corpus) linguists
(1) Garbage in, garbage out.
This is an old, general and, hopefully, broadly accepted experience and reinterpreted knowledge, having, in the case of corpora, a number of implications related to data and their treatment.
(2) The more data the better. But there is never enough data to help solve everything. This points to a need for more and also alternative resources, such as spoken data.
(3) The best information comes from direct data.
This points to the alternative of pure unadulterated texts, devoid of any annotation. Yet we still do not know how to properly handle them. The other alternative, annotation, now offered as the solution, is, to a varying extent, always biased and adulterates both the data input and results obtained. Hence, it should always be viewed as an alternative only.
(4) There is no all-embracing algorithm that is universal in its field and transferable to all other languages.
(5) Lemmatizers have invented imaginary new worlds, often creating nonexistent entities (forms) and suggesting false ones.
(6) It is not all research that glitters statistically.
(7) Language is both regular and irregular, not everything may be captured by algorithms automatically.
This points to the very much-neglected field of idioms, mostly, and grey zones of metaphoricity.
(8) The main goal of language is to code and decode meaning. Since meaning is not limited to words only, it is wrong to concentrate on words only. This point, often raised also by J. Sinclair, refers, among other things, to multiword lexemes and problems of compositionality of meaning. As yet, no reliable and general techniques for handling this are available.
(9) There are no aligners that will do the job for you automatically. 99% of this has to be done manually anyway.
(10) It is high time to ask computational linguists what their theories and programmes cannot do, how much of the field goes by the board and is never mentioned. Their alleged comprehensive coverage may be deceptive.

Reference
Cermak, Frantisek. 2002. Today's corpus linguistics: Some open questions. International Journal of Corpus Linguistics 7:2 265-282.
 
这个很有意义,只有知道自己的缺点才能在研究中有效的避免,对某些批评也就有了准备。
 
"Conclusions about language drawn from a particular corpus have to be treated as deductions, not as facts." - Hunston (2002: 23).
 
回复:What corpora cannot do? 语料库语言学家十诫

I'd like to see corpus linguistics be explored to its fullest potential
before worry about its limitations.

[本贴已被 作者 于 2005年07月12日 23时58分41秒 编辑过]
 
回复:What corpora cannot do? 语料库语言学家十诫

以下是引用 动态语法2005-7-12 13:34:42 的发言:
I'd like to see corpus linguistic be explored to its fullest potential
before worry about its limitations.

同意。乔姆斯基的转换生成语法在第一和第二语言模式时期的时候,就没有太多考虑limitation的因素,发展很快,解释力也很大,直到后来才加入种种限制。
 
以下是引用 tiger2005-7-13 9:48:44 的发言:
Languae is too complex and complicated, so I always doubt the results of my search.

One of the boastful things for corpora is that the results out of corpus data very often go against our intuitive understanding toward language.
 
Intuitions are not always reliable. They can be biased because of idolects and sociolects of individuals.
 
all of you have some reason, but the most important thing is to go ahead. the more we discover , the better we can deal with the problems. [emb6]
 
有人要这篇文章,所以灌水让他浮到上面来。以后可以试着用search,不过要记住帖子的标题。
 
There should be the Eleventh Commandment: The best supporting evidence you can expect to get from your corpus only tells half of the story. The other half is revealed by means of asterisked examples, which, by definition, are not attested anywhere.
 
回复:What corpora cannot do?

以下是引用 PTCP2005-8-31 10:59:04 的发言:
There should be the Eleventh Commandment: The best supporting evidence you can expect to get from your corpus only tells half of the story. The other half is revealed by means of asterisked examples, which, by definition, are not attested anywhere.

Unfortunately, in the linguistics field much attention has been paid to
asterisked examples rather than attested uses of language, and many
practitioners believe that by doing it this way they can get the whole truth.
 
Starred examples are by far less important in corpus-based studies than in traditional linguistic analyses. You will find most corpus studies where no starred examples are used or where they are used in contrast to attested corpus examples.
 
回复:What corpora cannot do? 语料库语言学家十诫

以下是引用 xiaoz2005-8-31 20:28:54 的发言:
Starred examples are by far less important in corpus-based studies than in traditional linguistic analyses. You will find most corpus studies where no starred examples are used or where they are used in contrast to attested corpus examples.


Isn't it exactly what I meant?
 
So we have agreed upon this point. It might be the half to half split that has given me the wrong impression that a linguistic analysis without starred examples is only half way to its destination, which I do not think is true.
 
回复: What corpora cannot do? 语料库语言学家十诫

许博士,这个话题很有意思,有没有这篇文章的全文呢?谢谢!
 
Back
顶部