What corpora cannot do? 语料库语言学家十诫
我从接触语料库开始就一直在思考这个问题。随着语料库及其方法在语言研究中的大行其道,颇有一种舍我其谁的架势。所以在此提出这个问题同诸位共同探讨语料库语言学的不足之处。
什么东西语料库解决不了或者解决不好?
PS:今天读了一篇题为Today's corpus linguistics: Some open questions的文章。文章的末了,作者给出了一个“语料库语言学家十诫”,很有意思,也附在这里供大家讨论。
Ten Commandments for computational (and corpus) linguists
(1) Garbage in, garbage out.
This is an old, general and, hopefully, broadly accepted experience and reinterpreted knowledge, having, in the case of corpora, a number of implications related to data and their treatment.
(2) The more data the better. But there is never enough data to help solve everything. This points to a need for more and also alternative resources, such as spoken data.
(3) The best information comes from direct data.
This points to the alternative of pure unadulterated texts, devoid of any annotation. Yet we still do not know how to properly handle them. The other alternative, annotation, now offered as the solution, is, to a varying extent, always biased and adulterates both the data input and results obtained. Hence, it should always be viewed as an alternative only.
(4) There is no all-embracing algorithm that is universal in its field and transferable to all other languages.
(5) Lemmatizers have invented imaginary new worlds, often creating nonexistent entities (forms) and suggesting false ones.
(6) It is not all research that glitters statistically.
(7) Language is both regular and irregular, not everything may be captured by algorithms automatically.
This points to the very much-neglected field of idioms, mostly, and grey zones of metaphoricity.
(8) The main goal of language is to code and decode meaning. Since meaning is not limited to words only, it is wrong to concentrate on words only. This point, often raised also by J. Sinclair, refers, among other things, to multiword lexemes and problems of compositionality of meaning. As yet, no reliable and general techniques for handling this are available.
(9) There are no aligners that will do the job for you automatically. 99% of this has to be done manually anyway.
(10) It is high time to ask computational linguists what their theories and programmes cannot do, how much of the field goes by the board and is never mentioned. Their alleged comprehensive coverage may be deceptive.
Reference
Cermak, Frantisek. 2002. Today's corpus linguistics: Some open questions. International Journal of Corpus Linguistics 7:2 265-282.
我从接触语料库开始就一直在思考这个问题。随着语料库及其方法在语言研究中的大行其道,颇有一种舍我其谁的架势。所以在此提出这个问题同诸位共同探讨语料库语言学的不足之处。
什么东西语料库解决不了或者解决不好?
PS:今天读了一篇题为Today's corpus linguistics: Some open questions的文章。文章的末了,作者给出了一个“语料库语言学家十诫”,很有意思,也附在这里供大家讨论。
Ten Commandments for computational (and corpus) linguists
(1) Garbage in, garbage out.
This is an old, general and, hopefully, broadly accepted experience and reinterpreted knowledge, having, in the case of corpora, a number of implications related to data and their treatment.
(2) The more data the better. But there is never enough data to help solve everything. This points to a need for more and also alternative resources, such as spoken data.
(3) The best information comes from direct data.
This points to the alternative of pure unadulterated texts, devoid of any annotation. Yet we still do not know how to properly handle them. The other alternative, annotation, now offered as the solution, is, to a varying extent, always biased and adulterates both the data input and results obtained. Hence, it should always be viewed as an alternative only.
(4) There is no all-embracing algorithm that is universal in its field and transferable to all other languages.
(5) Lemmatizers have invented imaginary new worlds, often creating nonexistent entities (forms) and suggesting false ones.
(6) It is not all research that glitters statistically.
(7) Language is both regular and irregular, not everything may be captured by algorithms automatically.
This points to the very much-neglected field of idioms, mostly, and grey zones of metaphoricity.
(8) The main goal of language is to code and decode meaning. Since meaning is not limited to words only, it is wrong to concentrate on words only. This point, often raised also by J. Sinclair, refers, among other things, to multiword lexemes and problems of compositionality of meaning. As yet, no reliable and general techniques for handling this are available.
(9) There are no aligners that will do the job for you automatically. 99% of this has to be done manually anyway.
(10) It is high time to ask computational linguists what their theories and programmes cannot do, how much of the field goes by the board and is never mentioned. Their alleged comprehensive coverage may be deceptive.
Reference
Cermak, Frantisek. 2002. Today's corpus linguistics: Some open questions. International Journal of Corpus Linguistics 7:2 265-282.