[原创]Can corpora contribute to linguistic theory?

xiaoz

永远的超级管理员
Staff member
58. Theory-driven corpus research

Richard Xiao

[To appear in 2006. In A. Lüdeling, M. Kyto & A. McEnery (eds) Handbooks of Linguistics and communication Science Volume Corpus Linguistics. Berlin: Mouton de Gruyter]

58.1. Introduction
The theory-driven versus data-driven distinction in linguistics is a manifestation of the conflict between rationalism and empiricism in philosophy. The extremist views of these two approaches to linguistics are vividly illustrated by Fillmore’s (1992) cartoon figures of the armchair linguist and the corpus linguist. The armchair linguist thinks what the corpus linguist is doing uninteresting while the corpus linguist believes what the armchair linguist doing untrue. It is hardly surprising that the divorce of theory and empirical data results in either untrue or uninteresting theories because any theory that cannot account for authentic data is a false theory while data without a theory is just a pile of data. As such, with exceptions of a few extremists from either camp who argue that “Corpus linguistics doesn’t mean anything” (cf. Andor 2004, 97), or that nothing meaningful can be done without a corpus (cf. Murison-Bowie 1996, 182), the majority of linguists (e.g. Leech 1992; Meyer 2002) are aware that the two approaches are complementary to each other. In Fillmore’s (1992) words, “the two kinds of linguists need each other. Or better, […] the two kinds of linguists, wherever possible, should exist in the same body.”
This chapter discusses the use of corpus data in developing linguistic theories and presents an effort to achieve a marriage between theory-driven and corpus-based approaches to linguistics via a case study of aspect, which has long been studied, but rarely with recourse to corpus data.

58.2. Can corpora contribute to linguistic theories?
To answer this question, we must first of all find out what linguistics is. We will then discuss the nature of data used in linguistics and explore how corpus data can contribute to linguistic theories.

58.2.1. What linguistics is about
It has been argued that linguistics is “the study of abstract systems of knowledge idealized out of language as actually experienced”, i.e. “idealized internalized I-language” (Widdowson 2000, 6). If linguistics is defined in this way, we must admit that any linguistic analysis involving performance data (E-language) has nothing to do with “linguistics” and should claim no place in “linguistics” at all (cf. Leech 2000: 685). The assumption underlying Widdowson’s definition is Chomsky’s (1965, 1986) claim that competence can be separated from performance to be studied alone. But can they?
The competence vs. performance divide is rooted in the hypothesis that grammar is autonomous within the human mind. Generative grammarians argue that our use of language (performance, E-language) cannot reflect our internal knowledge of language (competence, I-language), because of the constraints in naturally occurring language. Performance errors have been likened to abnormal conditions like tiredness and drunkenness in human communication (e.g. Radford 1997, 2). Only the internal grammar, which is based on native intuition and not polluted by performance constraints, is said to be part of competence. The corollary of this argument is the sharp distinction between langue and parole (Saussure 1966), between performance and competence (Chomsky 1965), and between grammar and usage (Newmeyer 2003). Nevertheless, evidence from recent research in psycholinguistics, neurolinguistics, and biology shows that the hypothesis of autonomous grammar, which underlies competence vs. performance dichotomy, is unsustainable (see Shei 2004 for a review). Rather, grammar is constantly shaped by culture (or environment factors) and interpersonal interactions. In Beaugrande’s (1997) words, “performance can crucially determine the development and quality of competence.” On the other hand, performance does not spring from nowhere C it is a natural and actual product of competence. As such, as Leech (1992, 108) observes, “the putative gulf between competence and performance has been overemphasized.”
Given the nature of this interdependence, the Chomskyan linguists’ practice of separating competence from performance is simply misleading in that it is in essence merely an “idealization of language for the sake of simplicity” (Abney 1996). In doing so, real language is replaced by idealized language which does not exist but which purports to sustain an explanation of language (see Beaugrande 1997). In the dialectic view of the relationship between competence and performance, therefore, the assertion is simply unsustainable that performance “cannot constitute the subject-matter of linguistics” (Chomsky 1965, 20), because competence is not directly accessible and our only gateway to it is through performance (cf. Meyer and Nelson 2005). Linguistics is in fact concerned with what language really is - as reflected by our knowledge, as well as use, of language.

58.2.2. Data used in linguistics
Broadly, there are three types of data that can be used in linguistic analysis: introspective data, elicited data, and corpus data (cf. Meyer/Nelson 2005). Introspection is a process in which the linguist uses his or her own intuition to invent examples for linguistic analysis or make acceptability judgments. Introspection is always useful in linguistics as the linguist can invent purer examples instantly for analysis. This is so because intuition is readily available and invented examples are free from language-external influences existing in naturally occurring language. Nevertheless, intuition should be applied with caution (cf. Seuren 1998, 260-262). Firstly, it is possible to be influenced by one’s dialect or sociolect; what appears unacceptable to one speaker may be perfectly felicitous to another. Secondly, when one invents an example to support or disprove an argument, one is consciously monitoring one’s language production. Therefore, even if one’s intuition is correct, the utterance may not represent typical language use. Thirdly, introspective data is decontextualized because it exists in the analyst’s mind rather than in any real linguistic context. Context is particularly relevant to acceptability and grammaticality judgments. With proper contexts, what might appear ungrammatical or unacceptable out of context can become grammatical and acceptable. Fourthly, results based on introspection alone are difficult to verify as introspection is not observable. Finally, excessive reliance on introspection blinds the analyst to the realities of language usage (cf. Meyer/Nelson 2005). As such, linguistic theories based on introspection alone can reflect nothing more than the idiolects of individual analysts.
In contrast with introspective data which relies solely on one’s own intuition, elicitation makes use of other people’s intuitions. Data collected through elicitation represents the intuitions of a group of informants. In general, elicited data is more reliable than introspective data, because it pulls together the intuitions of more than one speaker. However, the reliability of elicited data depends heavily upon the design of an experiment. A range of factors other than informants’ intuitions can affect the results of elicitation, which include, for example, whether an experiment is conducted in written or spoken form, whether the experimenter is present or absent in the elicitation test, whether informants are linguists or non-linguists, and what specific format the elicitation test takes (e.g. by asking informants to make either-or or scalar judgments, to rewrite sentences, or to fill gaps). Like introspective data, elicited data is decontextualized even though context is of crucial importance to the interpretation of the data in many areas of linguistic research.
It is clear from the above discussion that data collected through introspection and elicitation is artificial, as it is subject to conscious and subconscious monitoring on the part of the analyst or informants in the process of introspection or elicitation. In contrast, corpus data is natural, because this type of data is not created or elicited specifically for linguistic analysis. Rather, a corpus comprises samples of written/spoken language which has already occurred naturally in real linguistic context. As people speak and write on the basis of their intuitions in real contexts, corpus data is also intuition-based. But in relation to data collected through introspection and elicitation, a corpus typically reflects the intuitions of a much greater number of language users. A corpus also has the advantages of scale and variety (cf. Meyer/Nelson 2005).
It has been argued that corpus data should not be used in developing linguistic theory because a corpus, however large, is finite in size, is skewed in nature, and is likely to contain performance errors which have nothing to do with competence. It is true that most corpora, typically sample corpora, are finite in size. Unless you are studying a dead language or highly specialized sublanguage, it is virtually impossible to include every utterance or sentence of a given language in a corpus. However, the same can be said of introspective and elicited data. Furthermore, the finite size of a corpus can be viewed as a disadvantage or an advantage, depending upon one’s point of view. Given the elusiveness of language and the gradience in grammar (see Keller 2000), there has been an increasing consensus that quantification is indispensable in linguistic analysis. A corpus can offer quantitative data exactly because it has a finite size (a monitor also has a finite size at a particular point in time).
Closely related to the finite size is the alleged skewedness. Corpora, especially those used in what McEnery/Wilson (2001) call “early corpus linguistics”, are ready targets of this criticism because of their small sizes and inadequate sampling. But with developments in technology, and especially the development of ever more powerful computers offering ever increasing processing power and massive storage at relatively low cost, the exploitation of massive corpora has become feasible. Nowadays, corpora built for linguistic analyses have also become more representative through rigid sampling. While it is might be true that a 100-million word balanced corpus is still skewed to some extent, it is certainly less skewed than a dataset obtained through introspection on the basis of one analyst’s intuition or through elicitation on the basis of the intuitions of a small number of informants. Corpora have been criticized for being skewed simply because they are observable and open to scrutiny whereas introspection and elicitation are not.
It is also true that corpus data may contain performance errors. But assuming that what we see in a corpus is largely grammatical and/or acceptable, the corpus at least provides evidence of what speakers believe to be acceptable utterances in their language, typically free of the overt judgment of others. Furthermore, as a corpus presents data in context, it allows for research into what types of performance errors occur under what conditions and are typically associated with what contexts. Theories of this type cannot be developed on the basis of decontextualized data though they are of practical importance in linguistics. In our view, therefore, a “performance grammar” (Chomsky 1962, 537-538) that copes with regular and irregular language phenomena (including performance errors) is of greater importance than a “competence grammar” that has little bearing on “everyday production or comprehension of language” (Schutz 1996, xi). It is simply a vicious circle to develop a linguistic hypothesis on the basis of an analyst’s introspective data, which is used again to verify the same hypothesis.
In spite of the recurrent criticisms of corpus data by Chomskyan linguists, corpus linguistics has won widespread popularity and has been used in nearly all branches of linguistics (see McEnery/Xiao/Tono 2005). A corpus typically provides data that is attested, contextualized and quantitative. An additional advantage in using corpus data is that a corpus can find differences that intuition alone cannot perceive (cf. Francis/Hunston/Manning 1996). Broadly speaking, compared with the more traditional introspection-based approach, which rejected or ignored corpus data, the corpus-based approach can achieve improved reliability because it does not go to the extreme of rejecting intuition while attaching importance to empirical data. The key to using corpus data is to find the balance between the use of corpus data and the use of one’s intuition. As Leech (1991, 14) observes:
Neither the corpus linguist of the 1950s, who rejected intuition, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterize the many successful corpus analysis of recent years.

58.2.3. Corpus-based versus corpus-driven linguistics
Whether corpora should be used at all in linguistics is one issue, and how corpora should be used is another. Having established that corpus data should form the basis of linguistic theories, this section discusses how corpora can contribute to linguistics. Even among those who advocate the use of corpus data, there are different opinions and different approaches. One further area where differences diverge in corpus linguistics is with regard to the question of corpus-based and corpus-driven approaches.
In the corpus-based approach, it is said that corpora are used mainly to “expound, test or exemplify theories and descriptions that were formulated before large corpora became available to inform language study” (Tognini-Bonelli 2001, 65). Corpus-based linguists are accused of not being fully and strictly committed to corpus data as a whole as they have been said to discard inconvenient evidence (i.e. data not fitting the pre-corpus theory) by “insulation”, “standardization” and “instantiation”, typically by means of annotating a corpus. In contrast, corpus-driven linguists are said to be strictly committed to “the integrity of the data as a whole” (ibid: 84) and therefore, in this latter approach, it is claimed that “[t]he theoretical statements are fully consistent with, and reflect directly, the evidence provided by the corpus” (ibid: 85). Upon interrogating the available evidence, nevertheless, it is found that the distinction between the corpus-based vs. corpus-driven approaches is overstated and that this latter approach is an idealized extreme. There are three basic differences between the corpus-based vs. corpus-driven approaches: types of corpora used, attitudes towards existing theories and intuitions, focuses of research. Let us discuss each in turn.
Regarding the type of corpus data used, there are three issues Crepresentativeness, corpus size and annotation. Let us consider these in turn. According to corpus-driven linguist, there is no need to make any serious effort to achieve corpus balance and representativeness because the corpus is said to balance itself when it grows to be big enough as the corpus achieves so-called cumulative representativeness. This initial assumption of self-balancing via cumulative representativeness, nonetheless, is arguably unwarranted. For example, one such cumulatively representative corpus is a corpus of Zimbabwean English Louw (1991) used in his contrastive study of collocations of in British English and Zimbabwean English. This study shows that the collocates of wash and washing, etc in British English are machine, powder and spin whereas in Zimbabwean English the more likely collocates are women, river, earth and stone. The different collocational behaviors were attributed to the fact that the Zimbabwean corpus has a prominent element of literary texts such as Charles Mungoshi’s novel Waiting for the Rain, “where women washing in the river are a recurrent theme across the novel” (Tognini-Bonelli 2001, 88). One could therefore reasonably argue that this so called cumulatively balanced corpus was skewed. Especially where whole texts are included, a practice corpus-driven linguists advocate, it is nearly unavoidable that a small number of texts may seriously affect, either by theme or in style, the balance of a corpus. Findings on the basis of such cumulatively representative corpora may not be generalisable beyond the corpora themselves as their representativeness is highly idiosyncratic.
The corpus-driven approach also argues for very large corpora. While it is true that the corpora used by corpus-driven linguists are very large (for example, the Bank of English has grown to 524 million words), size is not all-important, as Leech (1991, 8-29) notes. Another problem for the corpus-driven approach relates to frequency. While it has been claimed that in the corpus-driven approach corpus evidence is exploited fully, in reality frequency may be used as a filter to allow the analyst to exclude some data from their analysis. For example, a researcher may set the minimum frequency of occurrence for a pattern which it must reach before it merits attention, e.g. it must occur at least twice C in separate documents (Tognini-Bonelli 2001, 89). Even with such a filter, a corpus-driven grammar would consist of thousands of patterns which would bewilder the learner. It is presumably to avoid such bewilderment that the patterns reported in the Grammar Patterns series (Francis/Hunston/Manning 1996; 1998), which are considered as the first results of the corpus-driven approach, are not even that exhaustive. Indeed, faced with the great number of concordances, corpus-driven linguists are often found to analyze only the nth occurrence from a total of X instances. This is in reality currently the most practical way of exploring a very large unannotated corpus. Yet if a large corpus is reduced to a small dataset in this way, there is little advantage in use very large corpora and it can hardly be claimed that corpus data is exploited fully and the integrity of the data is respected. It appears, then, that the corpus-driven approach is not so different from the corpus-based approach C while the latter allegedly insulates theory from data or standardizes data to fit theory, the former filters the data via apparently scientific random sampling, though there is no guarantee that the corpus is not explored selectively to avoid inconvenient evidence.
The corpus-driven linguists have strong objections to corpus annotation. This is closely associated with the second difference between the two approaches C different attitudes towards existing theories and intuitions. It is claimed that the corpus-driven linguists come to a corpus with no preconceived theory, with the aim of postulating linguistic categories entirely on the basis of corpus data, though corpus-driven linguists do concede that pre-corpus theories are insights cumulated over centuries which should not be discarded readily and that intuitions are essential in analyzing data. This claim is a little surprising, as traditional categories such as nouns, verbs, prepositions, subjects, objects, clauses, and passives are not uncommon in so-called corpus-driven studies. When these terms occur they are used without a definition and are accepted as given. Also, linguistic intuitions typically come as a result of accumulated education in preconceived theory. So applying intuitions when classifying concordances may simply be an implicit annotation process, which unconsciously makes use of preconceived theory. As implicit annotation is not open to scrutiny, it is to all intents and purposes unrecoverable and thus more unreliable than explicit annotation. Corpus-based linguists do not have such a hostile attitude toward existing theory. The corpus-based approach typically has existing theory as a starting point and corrects and revises such theory in the light of corpus evidence. As part of this process, corpus annotation is common. Annotating a corpus, most notably part of speech tagging, inevitably involves developing a tagset on the basis of an existing theory, which is then tested and revised constantly to mirror the attested language use. In spite of the usefulness of corpus annotation as a result, which greatly facilitates corpus exploration, annotation as a process is also important. As Aarts (2002, 122) observes, as part of the annotation process the task of the linguist becomes “to examine where the annotation fits the data and where it does not, and to make changes in the description and annotation scheme where it does not.” The claimed independence of preconception on the part of corpus-driven linguists is clearly an overstatement. A truly corpus-driven approach, if defined in this way would require something such as someone who has never received any education related to language use and therefore is free from preconceived theory, for as Sampson (2001: 135) observes, schooling plays an important role in forming one’s intuitions. Given that preconceived theory is difficult to totally reject and dismiss, and intuitions are indeed called upon in corpus-driven linguistics, we cannot see any real difference between the corpus-driven demand to re-examine pre-corpus theories in the new framework and corpus-based linguists’ practice of testing and revising such theories. Furthermore, if the so-called proven corpus-driven categories in corpus-driven linguistics, which are supposed to be already fully consistent with and directly reflect corpus evidence, also need refinement in the light of different corpus data, the original corpus data is arguably not representative enough. The endless refinement will result in inconsistent language descriptions which will place an unwelcome burden on the learner. In this sense, the corpus-driven approach is no better than the corpus-based approach.
The third important difference between the corpus-driven and corpus-based approaches is their different research focuses. As the corpus-driven approach makes no distinction between lexis, syntax, pragmatics, semantics and discourse (because all of these are pre-corpus concepts and they combine to create meaning), the holistic approach provides, unsurprisingly, only one level of language description, namely, functionally complete unit of meaning or language patterning. In studying patterning, corpus-driven linguists concede that while collocation can be easily identified in KWIC concordances of unannotated data, colligation is less obvious unless a corpus is grammatically tagged. Yet a tagged corpus is the last thing the corpus-driven linguists should turn to, as grammatical tagging is based on preconceived theory, and consequently results in a loss of information, in their view. To overcome this problem, Firth’s definition of colligation is often applied in a loose sense C in spite of the claim that corpus-driven linguists is deeply rooted in Firth’s work C because studying colligation in Firth’s original sense necessitates a tagged or even a parsed corpus. According to Firth (1968: 181), colligation refers to the relations between words at the grammatical level, i.e. the relations of “word and sentence classes or of similar categories” instead of “between words as such.” But nowadays the term colligation has been used to refer not only to significant co-occurrence of a word with grammatical classes or categories (e.g. Hoey 1997; 2000; Stubbs 2001c, 112) but also to significant co-occurrence of a word with grammatical words (e.g. Krishnamurthy 2000). The patterning with grammatical words, of course, can be observed and computed even using a raw corpus.
A final contrast one can note between corpus-based and corpus-driven approaches is that the corpus-based approach is not as ambitious as the corpus-driven approach. The corpus-driven approach claims to be a paradigm within which a whole language can be described. No such claim is entailed in the corpus-based approach. Yet the corpus-based approach, as a methodology which makes use of corpus data and intuition, has been applied in nearly all branches of linguistics.
The above discussion shows that the sharp distinction forced between the corpus-based vs. corpus driven approaches to linguistics is in reality fuzzy. In the remainder of the chapter, we will present a case study of aspect, which seeks to achieve a marriage between theory-driven and corpus-based approaches to linguistics.

58.3. Using corpora to inform aspect theory
<Truncated>

References cited in the above two sections
Aarts, J. (2002), Review of Corpus Linguistics at Work. In: International Journal of Corpus Linguistics 7(1), 118-123.
Abney, S. (1996), Statistical methods and linguistics. In: Klavans, J. & Resnik, P. (eds) The Balancing Act: Combining Symbolic and Statistical Approaches to Language. Cambridge, MA: MIT Press, 1-26.
Andor, J. (2004), The master and his performance: An interview with Noam Chomsky. In: Intercultural Pragmatics 1(1), 93C111.
Beaugrande, R. (1997), Theory and practice in applied linguistics: Disconnection, conflict, or dialectic? In: Applied Linguistics 18(3), 279-313.
Chomsky N. (1965), Aspects of the Theory of Syntax. Cambridge, Mass: MIT Press.
Chomsky, N. (1962), Explanatory models in linguistics. In: Nagel, E., Suppes, P. & Tarski, A. (eds) Logic, Methodology, and Philosophy of Science. Stanford: Stanford University Press, 528-550.
Chomsky, N. (1986), Knowledge of Language. New York: Praeger.
Fillmore, C. (1992), "Corpus linguistics" or "Computer-aided armchair linguistics". In: Svartvik, J. (ed) Directions in Corpus Linguistics: Proceedings of Nobel Symposium 82, Stockholm, 4-8 August 1991. Berlin, New York: Mouton De Gryer, 35-60.
Firth, J. (1968), A synopsis of linguistic theory. In Palmer, F. (ed) Selected Papers of J.R. Firth 1952-59. London: Longman, 168-205.
Francis, G./Hunston, S./Manning, E. (1996), Collins Cobuild Grammar Patterns 1: Verbs. London: HarperCollins.
Francis, G./Hunston, S./Manning, E. (1998), Collins Cobuild Grammar Patterns 2: Nouns and Adjectives. London: HarperCollins.
Hoey, M. (1997), From concordance to text structure: new uses for computer corpora. In: Melia, J. & Lewandowska, B. (eds) PALC ’97: Proceedings of Practical Applications of Linguistic Corpora Conference. Lodz: University of Lodz, 2-23.
Hoey, M. (2000), A world beyond collocation: new perspectives on vocabulary teaching. In: Lewis, M. (ed) Teaching Collocations. Hove: Language Teaching Publications, 224-245.
Keller, F. (2000), Gradience in Grammar. PhD thesis. University of Edinburgh.
Krishnamurthy, R. (2000), Collocation: from silly ass to lexical sets. In: Heffer, C., Sauntson, H. & Fox, G. (eds) Words in Context: A tribute to John Sinclair on his Retirement. Birmingham: University of Birmingham.
Leech, G. (1991), The state of art in corpus Linguistics. In Aijmer, K. & Altenberg, B. (eds) English Corpus Linguistics. London: Longman, 8-29.
Leech, G. (1992), Corpora and theories of linguistic performance. In: Svartvik, J. (ed.) Directions in Corpus Linguistics: Proceedings of 283 Nobel Symposium 82, Stockholm, 4-8 August 1991. Berlin: Mouton de Gruyter, 105-122.
Leech, G. (2000), Grammar of spoken English: new outcomes of corpus-oriented research. In: Language Learning 50(4), 675-724.
Louw, W. (1991), Classroom concordancing of delexical forms and the case for integrating language and literature. In: Johns, T. & King, P. (eds) Classroom Concordancing, ELR Journal 4. CELS University of Birmingham, 151-178.
McEnery, A./Wilson, A. (2001), Corpus Linguistics (1st ed. 1996). Edinburgh: Edinburgh University Press.
McEnery, A./Xiao, Z./Tono, Y. (2005), Corpus-based Language Studies: An advanced resource book. London: Routledge.
Meyer, C. (2002), English Corpus Linguistics: An introduction. Cambridge: Cambridge University Press.
Meyer, C./Nelson, G. (2005), Data collection. In: Aarts, B. & McMahon, A. (eds) The Handbook of English Linguistics. Oxford: Blackwell.
Murison-Bowie, S. (1996), Linguistic corpora and language teaching. In: Annual Review of Applied Linguistics 16, 182-199.
Newmeyer, F. (2003), Grammar is grammar and usage is usage. In: Language 79(4), 682-707.
Radford, A. (1997), Syntax: A Minimalist Introduction. Cambridge: Cambridge University Press.
Sampson, G. (2001), Empirical Linguistics. London: Continuum.
Saussure, F. (1916/1966), Course in General Linguistics. New York: McGraw-Hill.
Schütze, C. (1996), The Empirical Base of Linguistics. Chicago: University of Chicago Press.
Seuren, P. (1998), Western Linguistics: A Historical Introduction. Oxford: Blackwell.
Shei, C. (2004), Corpus and grammar: What it isn't. In: Li, I. (ed) Concentric Studies in Linguistics. Taipei: National Taiwan Normal University.
Stubbs, M. (2001), Words and Phrases. Oxford: Blackwell.
Tognini-Bonelli, E. (2001), Corpus Linguistics at Work. Amsterdam: John Benjamins.
Widdowson, H. (2000), The limitations of linguistics applied. In: Applied Linguistics 21(1), 3-25.
 
Back
顶部