In reply to No. 10 -
1) form vs. word form
For the sake of clarity, it is desirable to maintain a distinction between "form" (as opposed to meaning or function) and "word form" (as opposed to lemma);
2) lemmatizarion, POS tagging and parsing
These are three types of corpus annotation, there are of course many other types. While POS tagging and parsing are closely associated, lemmatisation is clearly a distinctive annotation type, which means lemmatisation is not rooted in POS tagging or parsing. They are not to be conflated.
3) form, pattern, patterning, "functionally complete unit of meaning"
"Form" is conventionally associated with "function", but it has also been used in relation to "meaning". Hunstun and Partington et al refer to it as pattern or patterning, while sinclair and Tognini-Bonelli et al call it "functionally complete unit of meaning".
4) plain text vs. corpus annotation
There are a number of criticisms of corpus annotation, notablyby so called "corpus-driven linguists". The plain text argument is one of them. Below is a section that deals with these issues.
Corpus annotation = added value
Like corpus markup, annotation adds value to a corpus. Leech (1997a: 2) maintains that corpus annotation is ‘a crucial contribution to the benefit a corpus brings, since it enriches the corpus as a source of linguistic information for future research and development.’ Both Leech (ibid: 4-5) and McEnery (2003: 454-455) suggest that there are at least four advantages for corpus annotation.
Firstly, it is much easier to extract information from annotated corpora in a number of ways. Leech (ibid) observes, for example, that without part-of-speech tagging (see unit 4.4.1), it is difficult to extract left as an adjective from a raw corpus as its various meanings and uses cannot be identified from its orthographic form or context alone. For example, the orthographic form left with a meaning opposite to right can be an adjective, an adverb or a noun. It can also be the past or past participle form of leave. With appropriate part-of-speech annotations these different uses of left can be readily distinguished apart. Corpus annotation also enables human analysts and machines to exploit and retrieve analyses of which they are not themselves capable (McEnery 2003: 454). For example, even if you do not know Chinese, given a suitably annotated Chinese corpus, you are able to find out a great deal about Chinese using that corpus (see case study 6 in Section C). Speed of data extraction is another advantage of annotated corpora. Even if one is capable of undertaking the required linguistic analyses, one is quite unlikely to be able to explore a raw corpus as swiftly and reliably as one can explore an annotated corpus if one has to start by annotating the corpus oneself.
Secondly, an annotated corpus is a reusable resource, as annotation records linguistic analyses within the corpus that are then available for reuse. Considering that corpus annotation tends to be costly and time consuming, reusability is a powerful argument in favour of corpus annotation (cf. Leech 1997a: 5).
Thirdly, an advantage of corpus annotation, related to reusability, is multi-functionality. A corpus may have originally been annotated with one specific purpose in mind. However, corpus analyses may be reused for a variety of applications and even for purposes not originally envisaged.
Finally, corpus annotation records a linguistic analysis explicitly. As such, the corpus annotation stands as a clear and objective record of analysis that is open to scrutiny and criticism (cf. McEnery 2003), a laudable goal.
In addition to these advantages we can also note that corpus annotation, like a corpus per se, provides a standard reference resource. While a corpus may constitute a standard reference for the language variety which it is supposed to represent, corpus annotation provides a stable base of linguistic analyses, objectively recorded, so that successive studies can be compared and contrasted on a common basis.
Having outlined the advantages of corpus annotation, it is necessary to address some of the criticisms of corpus annotation. Four main criticisms of corpus annotation have been presented over the past decade.
The first criticism is that corpus annotation produces cluttered corpora. Hunston (2002: 94) argues that ‘[h]owever much annotation is added to a text, it is important for the researcher to be able to see the plain text, uncluttered by annotational labels.’ While we agree that the plain text is important in a corpus analysis, especially in observing the patterning of words, corpus annotation does not necessarily obscure such a patterning, because most corpus exploration tools (e.g. WordSmith, MonoConc, SARA and Xaira, see Section C) do indeed allow users to suppress annotation in search results so as to allow users to view the plain text. As such this criticism is more directed at corpus browsing/retrieval tools rather than at corpus annotation per se.
A second criticism is that annotation imposes a linguistic analysis upon a corpus user. While it is true that corpus annotation is fundamentally interpretative in nature, there is no compulsion that corpus users accept that analysis. They can impose their own interpretations if they will or simply ignore the annotation. The plurality of interpretations of a text is something that must be accepted from the outset when undertaking corpus annotation (cf. McEnery 2003: 456). Yet just leaving a corpus unannotated does not mean that there is no process of interpretation occurring when the corpus is analyzed. Rather, the lack of annotation simply disguises the fact that such multiple-interpretations still occur when researchers use a raw corpus. The analysis still happens, it is simply hidden from clear view. Corpus annotation should be recognized as an advantage rather than a weakness in this respect as it provides an objective record of an explicit analysis open for scrutiny C failing to annotate is not simply a failure to analyze. Failing to annotate does, however, ensure that the analysis is difficult, or indeed impossible, to recover.
A further criticism is that annotation may ‘overvalue’ a corpus, making it less readily accessible, updateable and expandable (cf. Hunston 2002: 92-93). Annotation does not necessarily makes a corpus less accessible. For example, many parsed (e.g. the Lancaster Parsed Corpus and the Susanne corpus, see unit 7.4) and prosodically annotated corpora (e.g. the London-Lund Corpus and the Lancaster/IBM Spoken English Corpus, see unit 7.5) are publicly available. Corpus builders are usually happy to make their corpora available as widely as possible in spite of (or sometimes because of) the huge effort that they have put into annotation. Funders are also often prepared to finance corpus construction because a valuable annotated resource will be made widely available. Public funding bodies are particularly unlikely to fund corpus building projects which do not result in a readily accessible resource. A more common reason for not making an annotated corpus (or indeed a raw corpus) publicly available is that the copyright issues related to the corpus data prohibit it (see unit 9). Copyright, not annotation, is the greater force in favour of restriction. The arguments relating to updating and expansion are also questionable. Unlike a monitor corpus, which is constantly updated to track rapid language change (see unit 7.9 for further discussion), most corpora are sample corpora. A sample corpus is designed to represent a particular language variety at a particular time. For example, the LOB and Brown corpora are supposed to represent written British and American English in the early 1960s. There are indeed ‘updates’ for the two corpora C FLOB and Frown (see unit 7.4). The two updated corpora respectively represent written British and American English in the early 1990s and can be used to track slower paced language change (see unit 15.5). The need for constant expansion is only related to the dynamic monitor corpus model. It does not necessarily apply as an argument to sample corpora. Given that most corpora are sample corpora, the expandability argument is hardly important, as with a sample corpus size is typically determined when the corpus is designed. Once the corpus is created, there is generally no need for expansion.
The final criticism is related to the accuracy and consistency of corpus annotation. There are three basic methods of annotating a corpus C automatic, computer-assisted and manual (see unit 4.3). On the one hand, as Hunston (2002: 91) argues, ‘an automatic annotation program is unlikely to produce results that are 100% in accordance with what a human researcher would produce; in other words, there are likely to be errors.’ Such errors also occur when humans alone analyze the texts C even the best linguist at times makes mistakes. Introducing a human factor into annotation may have another implication; Sinclair (1992) argues, the introduction of a human element in corpus annotation, as in manual or computer-assisted annotation, results in a decline in the consistency of annotation. Taking the two points together, one might wonder why any linguist has ever carried out an analysis, as it would have been inaccurate and inconsistent! One must conclude that they have done so, and that annotators continue to do so because while inconsistency and inaccuracy in analyses are indeed observable phenomena, their impact upon an expert human analysis has been exaggerated. Also, the computer is not a sure-fire means of avoiding inaccuracy or inconsistency: the two points may also apply to machine analyses. Automatic annotation is not error-free, and it may be inconsistent. If resources are altered for an annotation program C the lexicon changed, rules rewritten C then over time the output of the program will exhibit inconsistency on a scale that may well exceed that displayed by human analysts. So what should we use for corpus annotation, human analysts or the computer? Given that the value of corpus annotation is well recognized, the human analyst and the machine should complement each other, providing a balanced approach to accuracy and consistency that seeks to reduce inaccuracy and inconsistency to levels tolerable to the research question that the corpus is intended to investigate.
It is clear from the above discussion that all of the four criticisms of corpus annotation can be dismissed, with caveats, quite safely. Annotation only means undertaking and making explicit a linguistic analysis. As such, it is something that linguists have been doing for centuries.
5) character-based approach
In terms of type frequency, monosyllabic words do not take a prominant place in Chinese, though of them are very frequently used (see token frequency) below. The following statistics are based on the 1 million word LCMC corpus (written) and the 0.92 million word Lancaster Los angeles Spoken Chinese Corpus:
1-gram: type 3941; token: 1011726
2-gram: type: 32449; token: 624330
3-gram: type: 8451; token: 42359
4-gram: type: 5766; token: 17101
4+-gram: type: 1710; token: 4593
It is clear that the character based approach cannot solve much problem. The word based approach, in contrast, also consider monosyllabic words as words.