The word "corpus", derived from the Latin word meaning "body", may be used to refer to any text in written or spoken form. However, in modern Linguistics this term is used to refer to large collections of texts which represent a sample of a particular variety or use of language(s) that are presented in machine readable form. Other definitions, broader or stricter, exist. See, for example, the definition in the book "Corpus Linguistics" by Tony McEnery and Andrew Wilson or read more about different kinds of corpora in the Systematic Dictionary of Corpus Linguistics.
Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation.
Computer-readable corpora can consist of raw text only, i.e. plain text with no additional information. Many corpora have been provided with some kind of linguistic information, here called mark-up or annotation.