The SUSANNE Analytic Scheme
The Need for Language Taxonomy
To enable computers to process human language, we need databases (corpora) of language samples annotated to show their structural features, as a source of information and statistics to guide the development of language-processing algorithms. This in turn requires some set of categories to be explicitly defined, so that researchers exchanging language data can be confident that they are using the annotations in the same way. Computational linguistics needs something like the Linnaean taxonomy created for botany in the 18th century, which for the first time enabled naturalists everywhere to exchange information about plants secure in the knowledge that when they used the same names they were talking about the same things.

(To get a sense of the massive variety of annotation practices which have emerged from the lack, in the past, of any explicit public taxonomy that researchers could choose to standardize on, see the catalogue compiled by the Linguistic Data Consortium.)

Beginning in 1983 I led an effort, which came to fruition with my 1995 book English for the Computer (Oxford University Press), to produce this sort of Linnaean taxonomy for English: the SUSANNE scheme. While it will certainly not be the last word on the subject, the SUSANNE scheme is so far as I am aware the first serious attempt anywhere to produce a comprehensive, fully explicit annotation scheme for English grammatical structure. It has won praise internationally, e.g.: