http://www.bossenglish.com/ReadNews.asp?NewsID=440
What is a Corpus, and what is in it?
The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.
Sampling and Representativeness
Often in linguistics we are not merely interested in an individual text or author, but a whole variety of language. In such cases we have two options for data collection:
We could analyse every single utterance in that variety - however, this option is impracticable except in a few cases, for example with a dead language which only has a few texts. Usually, however, analysing every utterance would be an unending and impossible task.
We could construct a smaller sample of that variety. This is a more realistic option.
As discussed in lecture 1, one of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously. This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which which a much less biased and representative corpus may be constructed.
We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions. What we are looking for is a broad range of authors and genres which, when taken together, may be considered to "average out" and provide a reasonably accurate picture of the entire language population in which we are interested.
Finite Size
The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words. Their main advantages are:
They are not static - new texts can always be added, unlike the synchronic "snapshot" provided by finite corpora.
Their scope - they provide for a large and broad sample of language.
Their main disadvantage is:
They are not such a reliable source of quantitative data (as opposed to qualitative data) because they are constantly changing in size and are less rigourously sampled than finite corpora.
With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size. (An exception is the London-Lund corpus, which was increased in the mid-1970s to cover a wider variety of genres.)
Machine-readable form
Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.
Today few corpora are available in book form - one which does exist in this way is "A Corpus of English Conversation" (Svartvik and Quirk 1980) which represents the "original" London-Lund corpus. Corpus data (not excluding context-free frequency lists) is occasionally available in other forms of media. For example, a complete key-word-in-context concordance of the LOB corpus is available on microfiche, and with spoken corpora copies of the actual recordings are sometimes available - this is the case with the Lancaster/IBM Spoken English Corpus but not with the London-Lund corpus.
Machine-readable corpora possess the following advantages over written or spoken formats:
They can be searched and manipulated at speed. (This is something which we covered at the end of Part One).
They can easily be enriched with extra information. (We will examine this in detail later.)
If you haven't already done so you can now read about other characteristics of the modern corpus.
A standard reference
There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.
One advantage of a widely available corpus is that it provides a yardstick by which successive studies can be measured. So long as the methodology is made clear, new results on related topics can be directly compared with already published results without the need for re-computation.
Also, a standard corpus also means that a continuous base of data is being used. This implies that any variation between studies is less likely to be attributed to differences in the data and more to the adequacy of the assumptions and methodology contained in the study.
Text Encoding and Annotation
If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.
For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.
Leech (1993) describes 7 maxims which should apply in the annotation of text corpora.
Formats of Annotation
Currently, there are no widely agreed standards of representing information in texts and in the past many different approaches have been adopted, some more lasting than others. One long-standing annotation practice is known as COCOA refernces. COCOA was an early computer program used for extracting indexes of words in context from machine readable texts. Its conventions were carried forward into several other programs, notably the OCP (Oxford Concordance Program). The Longman-Lancaster corpus and the Helsinki corpus have also used COCOA references.
Very simply, a COCOA reference consists of a balanced set of angled brackets (< >) which contains two entities:
A code which stands for a particular variable name.
A string or set of strings, which are the instantiations of that variable.
For example, the code "A" could be used to refer to the variable "author" and the string would stand for the author's name. Thus COCOA references which indicate the author of a passage of text would look like the following:
<A CHARLES DICKENS>
<A WOLFGANG VON GOETHE>
<A HOMER>
COCOA references only represent an informal trend for encoding specific types of textual information, e.g. authors, dates and titles. Current trends are moving more towards more formalised international standards of encoding. The flagship of this current trend is the Text Encoding Iniative (TEI), a project sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing and the Association for Computers and the Humanites. Its aim is to provide standardised implementations for machine-readable text interchange.
The TEI uses a form of document markup known as SGML (Standard Generalised Markup Language). SGML has the following advantages:
Clarity
Simplicity
Formally rigourous
Already recognised as an international standard
The TEI's contribution is a detailed set of guidelines as to how this standard is to be used in text encoding (Sperberg-McQueen and Burnard, 1994).
In the TEI, each text (or document) consists of two parts - a header and the text itself. The header contains information such as the following:
author, title and date
the edition or publisher used in creating the machine-readable text
information about the encoding practices adopted.
Textual and extra-textual information
The most basic type of additional information is that which tells us what text or texts we are looking at. A computer file name may give us a clue to what the file contains, but in many cases filenames can only provide us with a tiny amount of information.
Information about the nature of the text can often consist of much more than a title and an author.
These information fields provide the document with a whole document header which can be used by retrieval programs to search and sort on particular variables. For example, we might only be interested in looking at texts in a corpus that were written by women, so we could ask a computer program to retrieve texts where the author's gender variable is equal to "FEMALE".
Orthography
It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task, but even with a basic machine-readable text, issues of encoding are vital, although to English speakers their extent may not be apparent at first.
In languages other than English, the issue of accents and of non-Roman alphabets such as Greek, Russian and Japanese present a problem. IBM-compatible computers are capable of handling accented characers, but many other mainframe computers are unable to do this. Therefore, for maximum interchangeability, accented characters need to be encoded in other ways. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. For example, French speakers omit the accent entirely, writing Hélenè as Helene. To handle the umlaut, German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter, so Frühling would become Fruehling or Fr"uhling. However, these strategies cause additional problems - in the case of the French, information is lost, while in the German extraneous information is added.
In response to this the TEI has suggested that these characters are encoded as TEI entities, using the delimiting characters of & and ;. Thus, ü would be encoded by the TEI as
Types of annotation
Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow:
Part of Speech annotation
Lemmatisation
Parsing
Semantics
Discoursal and text linguistic annotation
Phonetic transcription
ProsodyMultilingual Corpora
Not all corpora are monolingual, and an increasing amount of work in being carried out on the building of multilingual corpora, which contain texts of several different languages.
First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language, but each contains completely different texts in those several languages. For example, the Aarthus corpus of Danish, French and English contract law consists of a set of three monolingual law corpora, which is not comprised of translations of the same texts.
The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. This refers to corpora which hold the same texts in more than one language. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew, Latin and Greek etc.
A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" would be equivalent to "is smoking" in English.
At present there are few cases of annotated parallel corpora, and those which exist tend to be bilingual rather than multilingual. However, two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers a restricted range of text types (proceedings of the Candian Parliament). However, this is an area of growth, and the situation is likely to change dramatically in the near future.
Conclusion
In this section we have -
seen what the term "corpus" entails
learnt about the standards of representing information in texts
learnt about headers and orthography
learnt about the types of annotation a corpus can be given
seen how a corpus can be bilingual or multilingual
Problem-oriented tagging
What is a Corpus, and what is in it?
The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a tv talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.
In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.
Sampling and Representativeness
Often in linguistics we are not merely interested in an individual text or author, but a whole variety of language. In such cases we have two options for data collection:
We could analyse every single utterance in that variety - however, this option is impracticable except in a few cases, for example with a dead language which only has a few texts. Usually, however, analysing every utterance would be an unending and impossible task.
We could construct a smaller sample of that variety. This is a more realistic option.
As discussed in lecture 1, one of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously. This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which which a much less biased and representative corpus may be constructed.
We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions. What we are looking for is a broad range of authors and genres which, when taken together, may be considered to "average out" and provide a reasonably accurate picture of the entire language population in which we are interested.
Finite Size
The term "corpus" also implies a body of text of finite size, for example, 1,000,000 words. This is not universally so - for example, at Birmingham University, John Sinclair's COBUILD team have been engaged in the construction and analysis of a monitor corpus. This "collection of texts" as Sinclair's team prefer to call them, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurence of new words, or for changing meanings of old words. Their main advantages are:
They are not static - new texts can always be added, unlike the synchronic "snapshot" provided by finite corpora.
Their scope - they provide for a large and broad sample of language.
Their main disadvantage is:
They are not such a reliable source of quantitative data (as opposed to qualitative data) because they are constantly changing in size and are less rigourously sampled than finite corpora.
With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size. (An exception is the London-Lund corpus, which was increased in the mid-1970s to cover a wider variety of genres.)
Machine-readable form
Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.
Today few corpora are available in book form - one which does exist in this way is "A Corpus of English Conversation" (Svartvik and Quirk 1980) which represents the "original" London-Lund corpus. Corpus data (not excluding context-free frequency lists) is occasionally available in other forms of media. For example, a complete key-word-in-context concordance of the LOB corpus is available on microfiche, and with spoken corpora copies of the actual recordings are sometimes available - this is the case with the Lancaster/IBM Spoken English Corpus but not with the London-Lund corpus.
Machine-readable corpora possess the following advantages over written or spoken formats:
They can be searched and manipulated at speed. (This is something which we covered at the end of Part One).
They can easily be enriched with extra information. (We will examine this in detail later.)
If you haven't already done so you can now read about other characteristics of the modern corpus.
A standard reference
There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.
One advantage of a widely available corpus is that it provides a yardstick by which successive studies can be measured. So long as the methodology is made clear, new results on related topics can be directly compared with already published results without the need for re-computation.
Also, a standard corpus also means that a continuous base of data is being used. This implies that any variation between studies is less likely to be attributed to differences in the data and more to the adequacy of the assumptions and methodology contained in the study.
Text Encoding and Annotation
If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.
For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.
Leech (1993) describes 7 maxims which should apply in the annotation of text corpora.
Formats of Annotation
Currently, there are no widely agreed standards of representing information in texts and in the past many different approaches have been adopted, some more lasting than others. One long-standing annotation practice is known as COCOA refernces. COCOA was an early computer program used for extracting indexes of words in context from machine readable texts. Its conventions were carried forward into several other programs, notably the OCP (Oxford Concordance Program). The Longman-Lancaster corpus and the Helsinki corpus have also used COCOA references.
Very simply, a COCOA reference consists of a balanced set of angled brackets (< >) which contains two entities:
A code which stands for a particular variable name.
A string or set of strings, which are the instantiations of that variable.
For example, the code "A" could be used to refer to the variable "author" and the string would stand for the author's name. Thus COCOA references which indicate the author of a passage of text would look like the following:
<A CHARLES DICKENS>
<A WOLFGANG VON GOETHE>
<A HOMER>
COCOA references only represent an informal trend for encoding specific types of textual information, e.g. authors, dates and titles. Current trends are moving more towards more formalised international standards of encoding. The flagship of this current trend is the Text Encoding Iniative (TEI), a project sponsored by the Association for Computational Linguistics, the Association for Literary and Linguistic Computing and the Association for Computers and the Humanites. Its aim is to provide standardised implementations for machine-readable text interchange.
The TEI uses a form of document markup known as SGML (Standard Generalised Markup Language). SGML has the following advantages:
Clarity
Simplicity
Formally rigourous
Already recognised as an international standard
The TEI's contribution is a detailed set of guidelines as to how this standard is to be used in text encoding (Sperberg-McQueen and Burnard, 1994).
In the TEI, each text (or document) consists of two parts - a header and the text itself. The header contains information such as the following:
author, title and date
the edition or publisher used in creating the machine-readable text
information about the encoding practices adopted.
Textual and extra-textual information
The most basic type of additional information is that which tells us what text or texts we are looking at. A computer file name may give us a clue to what the file contains, but in many cases filenames can only provide us with a tiny amount of information.
Information about the nature of the text can often consist of much more than a title and an author.
These information fields provide the document with a whole document header which can be used by retrieval programs to search and sort on particular variables. For example, we might only be interested in looking at texts in a corpus that were written by women, so we could ask a computer program to retrieve texts where the author's gender variable is equal to "FEMALE".
Orthography
It might be thought that converting a written or spoken text into machine-readable form is a relatively simple typing optical scanning task, but even with a basic machine-readable text, issues of encoding are vital, although to English speakers their extent may not be apparent at first.
In languages other than English, the issue of accents and of non-Roman alphabets such as Greek, Russian and Japanese present a problem. IBM-compatible computers are capable of handling accented characers, but many other mainframe computers are unable to do this. Therefore, for maximum interchangeability, accented characters need to be encoded in other ways. Various strategies have been adopted by native speakers of languages which contain accents when using computers or typewriters which lack these characters. For example, French speakers omit the accent entirely, writing Hélenè as Helene. To handle the umlaut, German speakers either introduce an extra letter "e" or place a double quote mark before the revelant letter, so Frühling would become Fruehling or Fr"uhling. However, these strategies cause additional problems - in the case of the French, information is lost, while in the German extraneous information is added.
In response to this the TEI has suggested that these characters are encoded as TEI entities, using the delimiting characters of & and ;. Thus, ü would be encoded by the TEI as
Types of annotation
Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow:
Part of Speech annotation
Lemmatisation
Parsing
Semantics
Discoursal and text linguistic annotation
Phonetic transcription
ProsodyMultilingual Corpora
Not all corpora are monolingual, and an increasing amount of work in being carried out on the building of multilingual corpora, which contain texts of several different languages.
First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language, but each contains completely different texts in those several languages. For example, the Aarthus corpus of Danish, French and English contract law consists of a set of three monolingual law corpora, which is not comprised of translations of the same texts.
The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. This refers to corpora which hold the same texts in more than one language. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew, Latin and Greek etc.
A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" would be equivalent to "is smoking" in English.
At present there are few cases of annotated parallel corpora, and those which exist tend to be bilingual rather than multilingual. However, two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers a restricted range of text types (proceedings of the Candian Parliament). However, this is an area of growth, and the situation is likely to change dramatically in the near future.
Conclusion
In this section we have -
seen what the term "corpus" entails
learnt about the standards of representing information in texts
learnt about headers and orthography
learnt about the types of annotation a corpus can be given
seen how a corpus can be bilingual or multilingual
Problem-oriented tagging