Glossary for Corpus-based linguistics


Staff member
annotation: (linguistic) information, such as POS tags or syntactic parsing that is added to a text/corpus. (*) (+) (external link)
annotate: provide text with annotation.
colligation: collocation patterns based on syntactic groups rather than individual words. (Barnbrook 1996)
collocation: patterns of words appearing together. (*) (+)
collocate: to appear together, or words that appear together. (In the collocations 'apple tree', 'apple pie', and 'Adam's apple', 'apple' collocates with 'tree', 'pie', and 'Adam's'. They are collocates.)
compile: collect and put together (for example, texts for a corpus).
concordance: A word/phrase and its surrounding context. Usually printed as a KWIC display. (+)
context: here usually the words surrounding a hit.
corpus (pl. corpora or corpuses): a collection of text, now usually in machine-readable form and compiled to be representative of a particular kind of language and provided with some kind of annotation. (*) (+)
encoding: annotation (+) (*)
hit: When your search string is found in the corpus, it is referred to as a hit or match.
KWIC (Key-Word In Context): a form of concordance where the hit is shown with a certain amount of context, often presented with the hit in the centre of the page. (example)
lemma: the set of different forms of a word, such as the inflected forms of a verb. Ex. 'sing', 'sang', 'sung' are one lemma, 'boy', 'boys' another.
lemmatisation: the process or result of dividing a text into lemmas.+
mark-up: codes used to provide information about a text, such as POS tags, SGML codes etc
match: When your search string is found in the corpus, it is referred to as a match or hit.
natural language: term used for human language, as opposed to artificial languages used for, for example, computer programming and formal logic.
NLP Natural Language Processing
parsing: the process or result of making a syntactic analysis (*) (+)
parser: tool (often automatic or semi-automatic computer program) used for parsing. ( General Description of Parsers - external link)
parsed corpus: a corpus that have been syntactically analysed and provided with annotation representing the analysis.
part-of-speech (POS): word class, such as verb, noun, adjective.
part-of-speech tagging: assigning part-of-speech tags to a text. (*)
SGML: (Standard Generalized Mark-up Language) mark-up system used for electronic text. (external link)
string: combination of letters/characters.
tag: a label associated with a word (or other unit) providing information about the word, or the process of assigning tags. See annotation. Ex: 'run' can be tagged as a noun (run_N) or verb (run_V).
tagging: the process or result of assigning tags.
tag-set: set of tags used for annotation. (external link)
TEI: (Text Encoding Initiative) International project set up "to develop guidelines for the preparation and interchange of electronic texts...". Uses SGML as starting-point.
thin: remove certain hits, either automatically or manually.
thinning: the process or result of removing certain hits, either by selecting the desired ones, selecting the ones to discard or by selecting/discarding a set amount of hits.
token: individual word. Compare type.
tokenisation: the process or result of dividing a text or list of words into tokens.
treebank: term sometimes used for parsed corpora.
type: wordform. "I see a cat and a dog" contains seven tokens but only six types (the type 'a' occurrs twice).


Staff member
Corpus Linguistics
A study of language that includes all processes related to processing, usage and analysis of written or spoken machine-readable corpora. Corpus linguistics is a relatively modern term used to refer to a methodology, which is based on examples of ‘real life’ language use. At present, effectiveness and usefulness of corpus linguistics is closely related to the development of computer science. See McEnery and Wilson 1996; Aarts and Meijs 1990; Leech 1991; Svartvik 1992.

Corpus Processing
A general term used to refer to all processes related to annotation, presentation and analysis of corpora. See Aarts and Meijs 1990; McEnery and Wilson 1996: Ch. 2.


A term is used to refer to the practice of defining explicit links between texts in a parallel corpus. Alignment is linking the elements (sentences, phrases or words) that are mutual translations of each other in parallel corpus. Sentence and word alignment (the term for performing this operation - aligner) may be performed with a high degree of accuracy automatically. See McEnery and Oakes 1996; McEnery and Wilson 1996: Ch. 2.


A term is used to refer to (i) the practice of adding explicit additional information to machine-readable text; (ii) the physical representation of such information. Annotation (or markup) makes it quicker and easier to retrieve and analyse information about the language contained in the corpus. A corpus may be annotated manually, by a single person or by a number of people; alternatively, the annotation may be carried out completely automatically or semi-automatically (output needs to be post-edited by human beings in the latter case) by a computer program. Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are frequently known as tagging rather than annotation, and the codes which are assigned are known as tags. See McEnery and Wilson 1996: Ch. 2; Leech 1993; Aarts and Meijs 1990; Brill 1992; Källgren 1996; Leech and Wilson 1994.

anaphoric annotation A form of annotation that refers to the marking of pronoun reference in corpora. Anaphoric annotation can only be carried out by human analysts, since it is one of the aims of the annotation to provide the data on which to train computer programs to carry out this task (see bootstrapping). It is of great importance to NLP since a large amount of conceptual context of a text is carried out by pronouns. See McEnery and Wilson 1996: Ch. 2; Halliday and Hasan 1976; Garside 1993.

discoursal annotation A type of annotation that is used to annotate items whose role in the discourse is primarily to do with discourse management (i.e. politeness, level of formality etc.) rather than with propositional content. Discoursal annotations have never become widely used in corpus linguistics since their identification in texts is a difficult task that causes a great source of dispute between different linguists. See McEnery and Wilson 1996: Ch. 2; Aone and Bennet 1994; Stenström 1984.

ditto tagging, ditto tag A term used to refer to the practice of assigning the same tag to each word in an idiomatic sequence to indicate that they belong to a single phraseological unit. See McEnery and Wilson 1996: Ch. 1; Garside 1987.

part-of-speech tagging A most basic type of linguistic corpus annotation (or grammatical tagging, morphosyntactic annotation, part-of-speech annotation); its aim is to assign a code (or tag) indicating its part-of-speech (e.g. singular common noun - NN, past participle - VBN) to each lexical unit in the text. Part-of-speech information is a fundamental basis for increasing the specificity of data retrieval from corpora and also forms an essential foundation for further forms of analysis such as syntactic parsing and semantic field annotation. See McEnery and Wilson 1996: Ch. 2; Leech and Wilson 1994; Garside 1987; Brill 1992.

phonetic transcription A form of phonetic annotation that is used to transcribe spoken corpora. Not many examples of publicly available fully phonetically transcribed corpora exist at the present time. Much of phonetic annotation exist at the level of prosodic annotation. Phonetic transcription needs to be carried out by human beings rather than computer programs, and moreover these need to be human beings who are well skilled in the perception and transcription of speech sounds. See McEnery and Wilson 1996: Ch. 2.

portmanteau tag A term used to refer to the practice of assigning two tags to some words in order to help the user in cases where there is a strong chance that the computer might otherwise have selected the wrong part-of-speech from the choices available to it. See McEnery and Wilson 1996: Ch. 1.

problem-oriented tagging A particular type of annotation that is used to annotate only the phenomena directly relevant to the research rather than the whole corpus or text (each word, each sentence etc.). It is not exhaustive. Problem-oriented tagging uses an annotation scheme which is selected not for its broad coverage and consensus-based theory-neutrality but for the relevance of the distinctions which it makes to the specific questions which each analyst wishes to ask of his or her data. See McEnery and Wilson 1996: Ch. 2; Haan 1984.

prosodic annotation A type of annotation that aims to capture in a written form the suprasegmental features of spoken language ― primarily stress, intonation and pauses. Prosodic annotation (or prosodic transcription-) is a task which requires the manual involvement of highly skilled phoneticians: unlike part-of-speech analysis, it is not task which can be delegated to the computer. See McEnery and Wilson 1996: Ch. 2; Nespor and Vogel 1990; Johansson et al. 1991; O’Connor and Arnold 1961.

recoverability A term used to refer to the possibility for the user to recover the basic original text from any text which has been annotated with further information. See McEnery and Wilson 1996: Ch. 2.

semantic annotation A type of annotation that is used to mark semantic relationships between items in the text (e.g. agents or patients of particular actions) or semantic features of words in a text (the annotation of word senses in one form or another. See McEnery and Wilson 1996: Ch. 2; Jansen 1990; Schmidt 1991.

tag A term used to refer to (i) a code attached to words in a text representing some feature or set of features relating to those words; (ii) in the TEI, to refer to the physical markup of an element such as a paragraph. See McEnery and Wilson 1996: Ch. 2.

tagset A term used to refer to a collection of tags in the form of a scheme for annotating corpora. See McEnery and Wilson 1996: Ch. 2; Johansson et al. 1986; Garside et al. 1987.


A term that signifies a list of a particular word or sequence of words in a context. The concordance is at the centre of corpus linguistics, because it gives access to many important language patterns in texts. Concordances of major works such as the Bible and Shakespeare have been available for many years. The computer has made concordances easy to compile.

The computer-generated concordances can be very flexible; the context of a word can be selected on various criteria (for example counting the words on either side, or finding the sentence boundaries). Also, sets of examples can be ordered in various ways. See Sinclair 1991: Ch. 2; McEnery and Wilson 1996: Ch. 1; Collier 1994; Kaye 1990; Hockey and Martin 1988.

co-text A more precise term than context or verbal context used to refer to the words on either side of a selected word or phrase. See Sinclair 1991: Ch. 9.

collocate A term used to refer to the words that occur to the left and to the right of the node. See Sinclair 1991: Ch. 8; Kennedy 1991; Kjellmer 1991; Kjellmer 1990; Renouf and Sinclair 1991; Jackson 1988.

collocation A term used to refer to the combination of words that have a certain mutual expectancy i.e. words regulary keep company with certain other words. When a collocation appears with a greater frequency than chance, then it is called a significant collocation. The usual measure of proximity is a maximum of four words intervening. The identification of patterns of word co-occurrence in textual data is particularly important in dictionary writing, natural language processing and language teaching. See Sinclair 1991: Ch. 8; Kennedy 1991; Kjellmer 1991; Kjellmer 1990; Renouf and Sinclair 1991; Jackson 1988.

KWAL An abbreviation for key word and line; a form of concordance which can allow several lines of context either side of the key word. See McEnery and Wilson 1996.

KWIC An abbreviation for key word in context; a form of concordance in which a word is given within x words of context and is normally centered down the middle of the page. See Sinclair 1991: Ch. 2; Kaye 1989.

node A term used to refer the word or phrase in a collocation whose lexical behaviour is under examination. See Sinclair 1991: Ch. 8; Jackson 1988.

span A term used to refer to the measurement, in words, of the co-text of a word selected for study. A span of -4, +4 means that four words on either side of the node word will be taken to be its relevant verbal environment. See Sinclair 1991; Jackson 1988.

Text Chunking

A term used to refer to the practice of dividing sentences into non-overlapping segments on the basis of fairly superficial analysis. Text chunking is a useful preliminary step to parsing. Chunking includes identifying the non-recursive portions of noun phrases, it can also be useful for other purposes including index term generation. See Ramshaw and Marcus; Sinclair 1991: Ch. 9.


A term used to refer to the practice of doing away with ambiguity by choosing one specific analysis, or code (tag), from a variety of possibilities in corpus processing. Procedure of disambiguation may be used at many levels from deciding the part-of-speech of an ambiguous word (i.e. a word that may be associated with a number of different parts-of-speech) through to choosing one possible translation from many. Disambiguation may be probabilistic, i.e., carried out using statistically based methods, or rule-based, i.e., performed using rules created by drawing on a linguist’s intuitive knowledge. See McEnery and Wilson 1996: Ch. 5; Jansen 1990; Hindle 1989; DeRose 1988.


A term used to refer to the practice of representing textual and linguistic data (i.e. annotations, or tags) in a certain format in a corpus. The demand for extensive reusability of large text collections requires standardisation of encoding formats. A standard encoding format must provide the most possible generality and flexibility, i.e., accommodate all potential types of information and processing. See Bryan 1988; McEnery and Wilson 1996: Ch. 2; Ide 1996.

CES An abbreviation for Corpus Encoding Standard used to refer to a set of encoding standards developed by MULTEXT (one of the largest EU projects in the domain of language tools and resources). The CES is an application of SGML, based on and in broad agreement with the TEI Guidelines and is optimally suited for use in corpus linguistics and language engineering applications. See Ide and Véronis 1995; Erjavec et al. 1995.

COCOA references A name of a very early computer program used for extracting indexes of words in context from machine-readable texts. Its conventions were carried forward into several other programs (e.g. Oxford Concordance Program (OCP)). COCOA references only represent an informal trend for encoding specific types of textual information, for example, authors, dates, and titles. See McEnery and Wilson 1996: Ch. 2; Hockey and Martin 1988.

DTD An abbreviation for Document Type Definition used in the TEI. TEI DTD is a formal representation which tells the user or a computer program what elements a text contains and how these elements are combined. A TEI DTD is composed of the core tagsets, a single base tagset, and any number of user selected additional tagsets, built up according to a set of rules documented in the TEI Guidelines. See Ide 1995; McEnery and Wilson 1996: Ch. 2; Sperberg-McQueen and Burnard 1994.

EAGLES An abbreviation for Expert Advisory Groups on Language Engineering Standards, an EU sponsored project to define standards for the computational treatment (e.g. annotation) of EU languages, and also used to refer to a base set of features for the annotation of parts-of-speech. See McEnery and Wilson 1996: Ch. 2.

entity reference A term in the TEI used to refer to a shorthand way of encoding information in a text. See Sperberg-McQueen and Burnard 1994.

SGML An abbreviation for Standard Generalized Markup Language used to refer to a text encoding standard (TEI conformant). SGML is an internationally recognized standard. SGML-aware software is widely used in corpus processing. See McEnery and Wilson 1996: Ch. 2; Erjavec 1995; Ide 1995; Goldfarb 1990; Bryan 1988.

TEI An abbreviation for Text Encoding Initiative, which signifies an international cooperative research project established (1988) to develop a general and flexible set of guidelines for the preparation and interchange of electronic texts. TEI employs an already existing form of document markup known as SGML. The TEI’s own original contribution is a detailed set of guidelines as to how this standard is to be used in text encoding. See Ide 1995; McEnery and Wilson 1996: Ch. 2; Sperberg-McQueen and Burnard 1994.

base tagset A term is used in the TEI to refer to a particular group of codes (tags) which determines the basic structure of the document with which it is to be used. Eight distinct TEI base tagsets are proposed: prose, verse, drama, transcribed speech, letters and memos, dictionary entries, terminological entries, language corpora and collections. See Ide 1995; Sperberg-McQueen and Burnard 1994.

TEI Guidelines A term used to refer to standardized encoding conventions for encoding and interchange of machine-readable texts. TEI Guidelines (issued in May 1994) provide standardized encoding conventions for a large range of text types and features relevant for a broad range of applications, including NLP, information retrieval, hypertext, electronic publishing, various forms of literary and historical analysis, lexicography, etc. The Guidelines are intended to apply to texts, written or spoken, in any natural language, of any date, in any genre or text type, without restriction on form or content. SGML is the framework for development of the Guidelines. See Sperberg-McQueen and Burnard 1994; Ide 1995; McEnery and Wilson 1996: Ch. 2.

header A term used to refer to a part of electronic document preceding the text proper and containing information about the document such as author, title, source and so on. See Ide 1995; McEnery and Wilson 1996: Ch. 2; Sperberg-McQueen and Burnard 1994.

WSD An abbreviation for Writing System Declaration used in the TEI to define the character set used in encoding an electronic text. See Sperberg-McQueen and Burnard 1994.


A term refers to the practice of reduction of word forms to their respective lexemes (head word forms that one would look up if one were looking for words in a dictionary) in a corpus. For example, the forms kicks, kicked, and kicking would all be reduced to the lexeme KICK. These variants are said to form the lemma of the lexeme KICK. Lemmatisation applies equally to morphologically irregular forms, so that went as well as goes, going, and gone, belongs to the lemma of GO. Lemmatisation allows the researcher to extract and examine all the variants of a particular lexeme without having to input all the possible variants. (A software for lemmatisation is called lemmatizer). See McEnery and Wilson: Ch. 2; Beale 1987; Sinclair 1991: Ch. 3.


A term used to refer to the practice of assigning the syntactic structure to a text. Parsing is usually performed after basic morphosyntactic categories have been identified in a text; it brings these categories into higher level syntactic relationships with one another. Parsing is probably the most commonly encountered form of corpus annotation after part-of-speech tagging. Corpora which have been parsed are sometimes known as treebanks. See McEnery and Wilson 1996: Ch. 2; Garside and McEnery 1993; Sampson 1992; Aarts and Heuvel 1985.

full parsing A type of parsing that aims to provide analysis of the sentence structure as detailed as possible. See McEnery and Wilson 1996: Ch. 2.

skeleton parsing A type of parsing that is a less detailed approach which tends to use a less finely distinguished set of syntactic constituent types and ignores, for example, the internal structure of certain constituent types. See Garside and McEnery 1993; Leech and Garside 1991.


A term used to refer to the investigation of conformance of any products or elements to certain acknowledged standards, i.e., the corpus has to be the size it claims, it must be composed and encoded the way it claims, all features encoded can be used for retrieval, annotations conform to a given standard, and, the error rate for encoding and annotation does not exceed a certain level. Validation guarantees the client that he gets what he ordered and that he can rely on the resources to the extent stated by the validation certificate. Validation has to be carried out on an unbiased and neutral basis, and this means not by the institution where the resources were created. See Teubert 1995.

Language/Linguistic Resources

A general term used to refer to such resources as corpora of spoken and written language, frequency lists, lexicons, computational linguistic lexicons and tools to extract linguistic knowledge to develop and optimize products. Linguistic resources are divided into corpora, lexical resources and tools. However, the borderline is not very distinct. See Gellerstam 1995; McEnery and Wilson 1996; Aarts and Meijs 1990; Edwards 1994.


A central term in corpus linguistics used to refer to (i) (loosely) any body of text; (ii) (most commonly) a body of machine-readable text; (iii) (more strictly) a finite collection of machine-readable texts, sampled to be maximally representative of a language variety. See McEnery and Wilson 1996: Ch. 2; Sinclair 1982 and 1991; Johansson 1991; Collins 1988; Meyer 1986; Aarts and Meijs 1990; Biber and Finegan 1991; Edwards 1994.

annotated corpus A type of corpus enhanced with various types of linguistic information (or tagged corpus). An annotated corpus may be considered to be a repository of linguistic information, because the information which was implicit in the plain text has been made explicit through concrete annotation. See McEnery and Wilson 1996: Ch. 1; Aarts and Meijs 1990.

balanced corpus A type of corpus composed according to parameters such as text type, genre or domain. See Teubert 1995.

comparable (reference) corpus A type of corpus used for comparison of different languages. Comparable corpus consist of a number of corpora in each language and follows the same composition pattern. The Commission of the European Community is funding a project whose main goal is the creation of comparable reference corpora (of 50 million words each) for all the official languages of the European Union. Comparable corpora are an indispensable source for bilingual and multilingual lexicons and a new generation of dictionaries. See LE-PAROLE 1995: Ann. 1.

monitor corpus A type of corpus which is a growing, non-finite collection of texts, of primary use in lexicography. Monitor corpus reflects language changes in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e. balance) as defined by the parameters. The same composition schema should be followed year by year, the basis being a reference corpus with texts spoken or written in one single year. See Sinclair and Ball 1995; Sinclair 1991: Ch. 1; Clear 1987.

monolingual corpus A type of corpus which contains texts in a single language. See McEnery and Wilson 1996: Ch. 2.

multilingual corpus A type of corpus which represents small collections of individual monolingual corpora (or subcorpora) in the sense that they use the same or similar sampling procedures and categories for each language but contain completely different texts in those several languages (for two languages bilingual corpus). See McEnery and Wilson 1996: Ch. 2; McEnery and Oakes 1994.

opportunistic corpus A type of corpus which stands for inexpensive collection of electronic texts that can be obtained, converted, and used free or at a very modest price; but is often unfinished and incomplete: the users are left to fill in blank spots for themselves. Their place is in environments where size and corpus access do not pose a problem. The opportunistic corpus is a virtual corpus in the sense that the selection of an actual corpus (from the opportunistic corpus) is up to the needs of a particular project. Today’s monitor corpora usually are opportunistic corpora. See Sinclair and Ball 1995.

parallel (aligned) corpus A type of multilingual corpus where texts in one language and their translations into other languages are aligned, sentence by sentence, preferably phrase by phrase. Sometimes reciprocate parallel corpora are set up, corpora containing authentic texts as well as translations in each of the languages involved. This allows double-checking translation equivalents.

Note: Some corpus linguists employ a different terminology for multilingual corpora: they refer to parallel corpora (as we defined here) as ‘translation corpora’ and use term ‘parallel corpora’ instead to refer to the other kind of multilingual corpus which does not contain the same texts in different languages. See Sinclair and Ball 1995; McEnery and Wilson 1996: Ch. 2; McEnery and Oakes 1994 and 1996; Zanettin 1994; Erjavec et al. 1995.

reference corpus A type of corpus that is composed on the basis of relevant parameters agreed upon by the linguistic community and should include spoken and written, formal and informal language representing various social and situational strata. They are used as benchmarks for lexicons and for the performance of generic tools and specific language technology applications. They are large in size; 50 million words is considered to be the absolute minimum; 100 million will become the European standard in a few years. See Sinclair and Ball 1995.

sampled corpus A type of corpus which contains a finite collection of texts, often chosen with great care and studied in detail. Once a sampled corpus is established, it is not added to or changed in any way. See Sinclair 1991: Ch. 1.

saturated corpus A type of corpus whose growth rate of the vocabulary stops decreasing and becomes constant (i.e. saturated). Thus, saturation is a point from which there will be perhaps eight new words for each 10000 additional words of text. Saturation of corpora is a fairly new concept, and no one knows what it leads to in terms of corpus size. See Teubert 1995.

special corpus A type of corpora that are assembled for a specific purpose, and they vary in size and composition according to their purpose. Special corpora are not balanced (except within the scope of their given purpose) and, if used for other purposes, give a distorted view of the language segment. Their main advantage is that the texts can be selected in such a way that the phenomena one is looking for occur much more frequently in special corpora than in balanced corpus. A corpus that is enriched in such a way can be much smaller than a balanced corpus providing the same data. See Sinclair and Ball 1995.

spoken corpus A type of corpora that contain texts of spoken language. Spoken corpora are annotated using a form of phonetic transcription. Not many examples of publicly available fully phonetically transcribed corpora exist at the present time. Phonetically transcribed corpora are a useful addition to the battery of annotated corpora, especially for the linguist who lacks the technological tools and expertise for the laboratory analysis of recorded speech. See McEnery and Wilson 1996: Ch. 2; Crowdy 1993; Greenbaum 1990.

treebank A type of corpora which have been annotated with phrase structure information (or parsed corpus). This term alludes to the representation of syntactic relationships (see parsing) by tree diagrams or phrase markers. See McEnery and Wilson 1996: Ch. 2; Garside and McEnery 1993; Souter and Atwell 1994.

unannotated corpus A type of corpora that are in raw states of plain text; opposed to annotated corpora. Unannotated corpora (or raw corpus) have been, and are, of considerable use in language study, but the utility of the corpus is considerably increased by the provision of annotation. See McEnery and Wilson 1996: Ch. 2.

Lexical Resources/Data

A general term used to refer to lexical data, preferably in machine-readable form, that can be used in lexical research and/or form the basis of commercial products. See Gellerstam 1995; Calzolari 1989.

computational linguistic lexicon A more complex type of lexicon for parsing, for artificial intelligence (question-answering) and for machine translation. See Gellerstam 1995.

frequency list A term used to refer to a list that is based on word frequency counts or on counts of other textual elements in a text, and listing the frequencies of their appearance. At present, making of frequency lists is one of the most trivial functions that lingware deals with. See Sinclair 1991: Ch. 2; Johansson and Hofland 1989; Woods et al. 1986; McEnery and Wilson 1996.

lexical data base (LDB) A term used to refer to data bases which contain formalized lexical information at many descriptive levels. It is one of the chief tools today for processing great quantities of lexical data. It can be used for various types of linguistic applications and for general research in the lexical field. Data base management system provides user with tools which enable him to access the data without necessarily being familiar with the internal or physical organisation, but only with the type of information he can retrieve. See Gellerstam 1995; Halteren and Heuvel 1990; Haan 1987; Kaye 1988; Calzolari 1989.

lexicon A term essentially synonymous with ‘dictionary’ - a collection of words and information about them, but this term is used more commonly than dictionary to refer to machine-readable dictionary data bases (or electronic dictionary). See Beale 1987; McEnery and Wilson 1996: Ch. 5; Garside and McEnery 1993; Garside 1987; Zernik 1991; Sinclair 1996; Calzolari 1989.

machine lexicon A type of lexicon which is not designed to be read by humans but provide explicit lexical information for performing specific tasks, e.g., automatic lemmatisation. See Gellerstam 1995.


A general concept which includes any tools or applications that are worth putting money into. See Engelien and McBryde 1991.

automatic hyphenizer A tool that automatically hyphenates a text according to grammatical conventions. See Gellerstam 1995.

computer-aided learning / computer-assisted language learning (CALL) A term used to refer to computer applications and software based on lexical data that can be used in various types of interactive teaching of written or spoken language skills such as sentence restructuring, checking of translation, dictation tasks, dictionary look-up, etc. One method of language learning is a data-driven learning approach that attempts to give direct access to the data and cut out the middleman. This approach is based on assumption that effective language learning is a form of research performed by the learner himself / herself. See Johns 1991; McEnery and Wilson 1993; McEnery et al. 1995; Wilson and McEnery 1994.

computer-aided translation (CAT) (or translator’s workbench) A term used to refer to computer systems, programs or applications which contain tools and facilities which help translators to increase their productivity and the quality of their work. These include monolingual or bilingual lexicons, translation memories (which help to avoid translating the same or similar fragments more than once), spelling checkers, terminology databases, translation editors, terminology extraction, access to previously translated texts, document comparison, thesauruses, etc. See Krauwer 1995; McEnery and Wilson 1996: Ch. 5.

general text checker A tool that checks practical things like starting a new sentence with a capital letter, spotting extra spaces between words, etc. See Gellerstam 1995.

spelling checker A tool that is usually based on a collection of word forms representing an actual corpus or a list of word forms generated from a dictionary, and it is used to find spelling errors in a text. Spelling checker is probably the number one commercial application and its facilities are more or less standard ingredients in word processing today. See Teubert 1995; Gellerstam 1995.

style checker A tool that performs checking of particular words from stylistic point of view (“why do you use the passive form?”), parsing for spotting grammatical errors (like congruence), and checking of contextual data (“have you used the right preposition after the verb?”). See Gellerstam 1995.


Staff member

Haiyang Ai



Staff member
回复:Glossary for Corpus-based linguistics

以下是引用 hancunxin2005-7-1 16:21:40 的发言: