What is not a corpus?-Sinclair

xujiajin

管理员
Staff member
What is not a corpus?
As we move towards a definition of a corpus, we remind ourselves of some of the things that a corpus might be confused with, because there are many collections of language text that are nothing like corpora.

The World Wide Web is not a corpus, because its dimensions are unknown and constantly changing, and because it has not been designed from a linguistic perspective. At present it is quite mysterious, because the search engines, through which the retrieval programs operate, are all different, none of them are comprehensive, and it is not at all clear what population is being sampled. Nevertheless, the WWW is a remarkable new resource for any worker in language (see Appendix), and we will come to understand how to make best use of it.

An archive is not a corpus. Here the main difference is the reason for gathering the texts, which leads to quite different priorities in the gathering of information about the individual texts.

A collection of citations is not a corpus. A citation is a short quotation which contains a word or phrase that is the reason for its selection. Hence it is obviously the result of applying internal criteria. Citations also because lack the textual continuity and anonymity that characterise instances taken from a corpus; the precise location of a quotation is not important information for a corpus researcher.

A collection of quotations is not a corpus for much the same reasons as a collection of citations; a quotation is a short selection from a text, chosen on internal criteria and chosen by human beings and not machines.

These last two collections correspond more closely to a concordance than a corpus. A concordance also consists of short extracts from a corpus, but the extracts are chosen by a computer program, and are not subject to human intervention in the first instance. Also the constituents of a corpus are known, and searches are comprehensive and unbiased. Some collections of citations or quotations may share some or all of these criteria, but there is no requirement for them to adopt such constraints. A corpus researcher has no choice, because he or she is committed to acquire information by indirectly searching the corpus, large or small.

A text is not a corpus. The main difference (Tognini Bonelli 2001 p.3) is the dimensional one explained above. Considering a short stretch of language as part of a text is to examine its particular contribution to the meaning of the text, including its position in the text and the details of meaning that come from this unique event. If the same stretch of language is considered as part of a corpus, the focus is on its contribution to the generalisations that illuminate the nature and structure of the language as a whole, far removed from the individuality of utterance.
 
Sinclair 2005. 'Corpus and Text ― Basic Principles'
In M. Wynne ed. Developing Linguistic Corpora: a Guide to Good Practice. Oxford: AHDS.

http://www.corpus4u.com/forum_view.asp?forum_id=60&view_id=633

Why don't you read it yourself?
 
In What is not a corpus Sinclair Says the World Wide Web is not a corpus, because of its dimensions are unknown and constantly changing, and because it has not been designed from a linguistic perspective. At present it is quite mysterious, because the search engines, through which the retrieval programs operate, are all different, none of them are comprehensive, and it is not at all clear what population is being sampled. Nevertheless, the WWW is a remarkable new sources for any worker in language.
但是在"Google:作为搭配词典的重要补充"一文(见2004年10月第99期外语电化教学)中,作者潘家云认为Internet是一个很大的语料库,其中隐藏着巨大的搭配词典。Google可以辅助词典查询出更多hot and fresh words and phrases in use rather than in the brain.
那么按照Sinclair的解释,从容量无穷大的Internet里搜索出来的活生生的单词和短语搭配由于没有借助语言学定义的搜索引擎和不科学的取样显得很神秘(quite mysterious)。是不是我们就可以理解为这些搭配不科学而不能指导我们的外语教学呢?
 
如果你在google里发现一个有趣的新的用法,你能确定它是英语母语者说/写的吗?这是不太好说。另外,即便是,如果只是极个别的人在用,比如某些ethnic group的人用,或者某些作家的个人创作,我们能用来吗学生吗?

所以,google是一个很强大的工具,但对其中的内容还是要加以甄别。你知道,google上出来的英文没准还是你自己写的呢?这都很难说的。
 
Back
顶部