Professional English Research Consortium (PERC)


Staff member

Project Description

? Introduction
? Running Time of the Project
? Design Criteria
? Text Encoding
? Linguistic Annotation of Texts
? Storage and Documentation of Texts
? Corpus Retrieving Software
? List of Text Distributors

The Corpus of Professional English (CPE) is a major research project of PERC currently underway that, when finished, will consist of a 100-million-word computerized database of English used by professionals in science, engineering, technology and other fields. The CPE will be used for research as well as for the development of educational resources, such as specialized dictionaries, handbooks, language tests, and other materials that will be useful to working professionals and professionals-in-training.

When complete, portions of the corpus will be made available to researchers, enabling them to retrieve various kinds of linguistic information via the Internet with our original search software. The software is programmed so as not to allow users to extract complete sample texts that might infringe on copyright laws. A minimal charge will be made for access to the CPE for general researchers to cover the running costs of the online search system.

The publishers in the consortium clearly recognize the dangers inherent in electro-copying and are as concerned as you are that the CPE should not allow the abuse of copyrighted texts. A text sample, by being included in the CPE, loses none of the protection afforded by copyright law.

The End User license strictly controls the use of the CPE and the text samples it contains. The right of reproduction of individual original text samples by any means is explicitly forbidden. None of the text samples in their original form will be incorporated into any product. Quotations from text samples will be strictly limited by the fair dealing provisions of copyright law.

The uses to which the CPE will be put will typically include the following: Professional English research and the development of educational materials, such as specialized dictionaries and other educational resources which require accurate information about word meanings and usage, collocations, and other relevant linguistic data

Running Time of the Project
December 2001-December, 2003
(1st phase for science and technology texts)

Design Criteria (1st phase)
A) Monolingual
B) Professional writing (academic standard of texts)
C) Synchronic (1995-2001)
D) Regional variety (AmE/ BrE/ etc.)
E) Sample (50,000 words per text/ full text/ etc.)
F) Selection criteria
domains: science and technology including life science
(based upon "Journal Citation Report")
media: academic journals, trade magazines, textbooks, web pages,

Text Encoding
The following information will be indicated by the mark-up:
1. Boundaries and parts of speech
2. Sentence structure identified by a POS tagger
3. Paragraphs, sections, headings and similar features in written
4. Meta-textual information about the source or encoding of
individual texts
The XML format will be adopted for text encoding.

Linguistic Annotation of Texts
The grammatical tagging of the text will be done in collaboration with Lancaster University (UCREL).

Storage and Documentation of Texts
Detailed descriptive information will be added to each text, in the form of a header: the author's name, title, publication year, journal title, etc.

Corpus Retrieving Software
Web-based multi-functional search software developed by Shogakukan Multimedia Department will be used. The software is also to be adoped for the online BNC search service and the online CobuildDirect search service administered by Shogakukan for Japanese users with authorization from the BNC and HarperCollins.