Joint E-C Bilingual Corpus Linguistics glossary


Please paste all the jargons of corpus linguistics that you find in reading and study. besides those that have been offered Chinese counterparts widely acknowledeged by the corpus linguists and users, you may give new jargons your own translation according to your understanding. Let's discuss and piece up a reference list of jargons of corpus linguistics. how do you think about it?

[本贴已被 xujiajin 于 2005年09月17日 15时56分09秒 编辑过]
Great idea! We can then compile a glossary in English and Chinese.
It seems that the sudggestion is not so popular! I would like to paste the first term of translation for your inspection and discussion, my fellows!
Term: Reader/Writer Visibility
Source: Writer/Reader Visibility in EFL Written Discourse by Stephanie Petch-Tyson in the Leaner English on Computer edited by Sylviane Granger,1998 studies in Language and Linguistics series, Longman
Related Literature: 中国大学生英语书面语中的口语化倾向, 文秋芳等,外语教学与研究,2003
Significance: It marks the tendency of user's language style on the continuum from the spoken style to the written one.
you can find 中国大学生英语书面语中的口语化倾向, 文秋芳等,外语教学与研究,2003 in the column of 文章荟萃 of the site. i am sorry i don't know how to insert a piece of uploaded article in the reply! enjoy it!
tag: 标记/码/标注码 (附码 - xiaoz)
tagging: 附码/赋码 加码(陶红印)
tagset: 附码集/赋码集
annotation: 标注
markup: 标注 (标记 - xiaoz)

[本贴已被 xiaoz 于 2005年08月18日 20时36分58秒 编辑过]
parallel corpus: 平行语料库
learner corpus: 学习者语料库
monitor corpus: 监控语料库
The glossary I have prepared for my new book (some corpus names and organisation names can be omitted):

Could anyone help to give Chinese equivalents?

AAVE: African American Vernacular English
ACE: the Australian Corpus of English, also known as the Macquarie Corpus
ACH: the Association for Computers and the Humanities
ACL: the Association for Computational Linguistics
alignment: 对齐。establishing a link between the source text and the translation, usually at the sentence, phrase or word level.
ALLC: the Association for Literary and Linguistic Computing
ANC: 美国国家语料库 the American National Corpus
annotation: 标注/赋码/附码。the process of encoding interpretative linguistic information in a corpus
ARCHER: a Representative Corpus of Historical English Registers
ASCII: American Standard Code for Information Interchange
authenticity: 真实性 a feature that characterizes naturally occurring corpus data
BNC: 英国国家语料库 the British National Corpus
BNCweb: the web interface of the BNC, developed at Zurich University
BOE: the Bank of English
Brown: the Brown University Standard Corpus of Present-day American English
CA: 对比分析 contrastive Analysis
CANCODE: the Cambridge and Nottingham Corpus of Discourse in English
CDA: 批判话语分析/批评话语分析 critical discourse analysis
CED: the Corpus of English Dialects
CEPC: Chinese-English Parallel Corpus
CES: the Corpus Encoding Standard
character encoding: 字符编码 a system of using numeric values to represent characters
CHILDES: the Child Language Data Exchange System
chi-square test: a measure of statistical significance
CIA: 中介语对比分析 Contrastive Interlanguage Analysis
CKIP: the Chinese Knowledge Information Processing group at Academia Sinica, Taipei
CLC: Cambridge Learner Corpus
CLEC: Chinese Learner English Corpus
COCOA: one of the earliest markup schemes that uses a set of attribute names and values enclosed in angled brackets
colligation: 类连接 the collocation of a node word with a particular grammatical class of words
collocation: 搭配 the characteristic co-occurrence of patterns of words
comparable corpus: 对应语料库。a corpus which is composed of L1 data collected from different languages using the same sampling techniques
comparative corpus: 比较语料库。a corpus containing components of varieties of the same language
concordance: 检索。an alphabetical index of a search pattern in a corpus, showing every contextual occurrence of the search pattern
concordancer: 索引工具 a software package that extracts concordances from a corpus
corpora: 语料库(复数)the widely accepted plural form of corpus
corpus balance: 语料库的均衡性 the range of different types of language that a corpus claims to cover
corpus header: 语料库头文件 the part of a corpus that provides necessary bibliographical information, taxonomies used and other metadata relating to a corpus
corpus: 语料库 a collection of sampled texts, written or spoken, in machine readable form which may be annotated with various forms of linguistic information
corpuses: a less commonly used plural form of corpus
CPE: the Corpus of Professional English
CPSA: the Corpus of Professional Spoken American English
cross-tabulation: a table showing the frequencies for each variable across each sample
CSAE: the Corpus of South African English
DCMI: the Dublin Core Metadata Initiative
DDL: data-driven learning
dispersion: a term in descriptive statistics which refers to a quantifiable variation of measurements of differing members of a population within the scale on which they are measured
ditto tag: in corpus annotation assigning the same part-of-speech code to each word in an idiomatic expression
DTD: Document Type Definitions in markup languages such as HTML, SGML and XML
EAGLES: Expert Advisory Group on Language Engineering Standards
EAP: English for Academic Purpose
EBMT: Example-based Machine Translation
EMILLE: the Enabling Minority Language Engineering (project and corpora)
ENPC: the English-Norwegian Parallel Corpus
error-tagging: 错误赋码/错误标注 assigning codes indicating the types of errors occurring in a learner corpus
factor analysis: 因子分析 a statistical analysis commonly used in the social and behavioural sciences to summarize the interrelationships among a large group of variables in a concise fashion
fisher's exact test: an alternative to the chi-square or log-likelihood test that measures exact statistical significance level
FLOB: the Freiburg-LOB Corpus of British English, an update of the LOB corpus in the early 1990s
frequency: 频数/频率/频次。also called raw frequency, the actual count of a linguistic feature in a corpus
Frown: the Freiburg-Brown Corpus of American English, an update of the Brown corpus in the early 1990s
HKUST: the HKUST Computer Science Corpus
HTML: 超文本标记语言 Hypertext Markup Language
ICE: the International Corpus of English
ICLE: the International Corpus of Learner English
IMDI: the ISLE Metadata Initiative
IMDI: the ISLE Metadata Initiative
interlanguage: 中介语/过渡语 the learner’s knowledge of the L2 which is independent of both the L1 and the actual L2
JEFLL: the Japanese EFL Learner Corpus
keyword: 关键词。words in a corpus whose frequency is unusually high (positive keywords) or low (negative keywords) in comparison with a reference corpus
KWIC: key-word-in-context concordance
LCA: the Lancaster Corpus of Abuse
LCMC: the Lancaster Corpus of Mandarin Chinese
lemmatization: 词形归并/还原(有一种翻译为削尾处理,我认为不准确)grouping together all of the different inflected forms of the same word
lexicon: an inventory of word forms in a given language
LGSWE: the Longman Grammar of Spoken and Written English
LIVAC: Linguistic Variations in Chinese Speech Communities, a synchronous corpus of Mandarin Chinese
LLC: the London-Lund Corpus; also found to refer to the Longman Learner Corpus in the literature
LOB: the Lancaster-Oslo-Bergen Corpus of British English
LOCNESS: the Louvain Corpus of Native English Essays
log-likelihood test: also known as an LL test, an alternative to the chi-square test
LPC: the Lancaster Parsed Corpus
LSAC: the Longman Spoken American Corpus
LSP: language for specific purposes
markup: 标记。a system of standard codes inserted into a document stored in electronic form to provide information about the text itself and govern formatting, printing or other processing
MATE: the Multi-Level Annotation Tools Engineering project
mean: 平均数。the arithmetic average, which can be calculated by adding all of the scores together and then dividing the sum by the number of scores
merger: combination of two or more words (e.g. can’t and gonna)
metadata: a term used to describe data about data, typically the contextual information of corpus samples
MI: mutual information, a statistical formula borrowed from information theory
MICASE: the Michigan Corpus of Academic Spoken English
Microconcord: a concordance package published the Oxford University Press
ML: machine learning
MLCT: the Multilingual Corpus Tool package developed by Scott Songlin Piao
monitor corpus: 监控语料库。a corpus that is constantly supplemented with fresh material and keeps increasing in size
MonoConc: a concordancer package published by Athelstan
MUC: the Message Understanding Conference
Multiconcord: a multilingual parallel concordancer developed at the University of Birmingham
MWU: multiword unit
NLP: natural language processing
normalization: a process which makes frequencies from samples of markedly different sizes comparable by bringing them to a common base
OCR: optical character recognition
OLAC: the Open Language Archives Community
ParaConc: a bilingual or multilingual concordancer published by Athelstan
parallel corpus: 平行语料库。a corpus which is composed of source texts and their translations in one or more different languages; sometimes referred to as translation corpus
parsing: 句法分析。also called treebanking or bracketing, a process that analyzes the sentences in a corpus into their constituents
PERC: the Professional English Research Consortium
PNC: the Polish National Corpus
population: 总体 the entire set of items from which samples can be drawn
POS: part-of-speech
post-editing: human correction of automatically processed data
range: the difference between the highest and lowest frequencies
reference corpus: 参考语料库。a balanced representative corpus balanced for general usage; in keyword analysis, a corpus that is used to provide a reference wordlist
: 代表性。a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety
RP: Received Pronunciation, the notional standard form of spoken British English
sample: 样本 elements that are selected intentionally as a representation of the population being studied
sample corpus: 样本语料库。as opposed to a monitor corpus, a sample corpus is of finite size and consists of text segments selected to provide a static snapshot of language
SARA: SGML Aware Retrieval Application for the BNC
SBCSAE: the Santa Barbara Corpus of Spoken American English
SEC: the Lancaster/IBM Spoken English Corpus
SED: the Survey of English Dialects corpus
semantic prosody: 语义韵。the collocational meaning arising from the interaction between a given node word and its collocates
SEU: Survey of English Usage
SGML: the Standard Generalized Markup Language
skeleton parsing: also called shallow parsing, a parsing technique that uses less fine-grained constituent types rather than would be present in a full parse
SLA: second language acquisition
sort: arrange concordances or a wordlist in a certain order
SPAAC: the Speech Act Annotated Corpus developed at UCREL, Lancaster
specialized corpus: 专用语料库。a corpus that is domain or genre specific and is designed to represent a sublanguage
SPSS: Statistical Package for the Social Sciences
SST: the Standard Speaking Test corpus consisting of spoken data produced Japanese learners of English
standardized type-token ratio: similar to type-token ratio, but computed every n (e.g. 1,000) words as the WordSmith Wordlist goes through each text file
subcorpus: 子库。a component of a corpus, usually defined using certain criteria such as text types and domains
tagging: 附码。an alternative term for annotation, especially word-level annotation such as POS tagging and semantic tagging
tagset: 附码集。a scheme of codes for corpus annotation, especially POS tagging
TEI: the Text Encoding Initiative
token: an occurrence of any given word form
tokenization: also called segmentation, a process that divides running text into legitimate word tokens, especially important for languages such as Chinese that do not delimit words with white spaces
transcription: 转写。converting spoken data into a written form
translationese: 翻译腔。a version of L1 language that has been influenced by the translation process
treebank: 树库。an alternative term for a parsed corpus
t-test: an alternative statistical test to the chi-square test
type: a word form
type-token ratio: the ratio between types and tokens, useful when comparing samples of roughly equal length
UCL: University College London
UCLES: the University of Cambridge Local Examinations Syndicate
UCREL: the University Centre for Computer Corpus Research on Language, Lancaster
Unicode: a character encoding system designed to support the interchange, processing, and display of all of the written texts of the diverse languages of the world
URL: Uniform Resource Locator, i.e. an Internet address
USAS: the UCREL Semantic Analysis System
UTF: Unicode Transformation Format
wildcard: 通配符。a special character such as an asterisk (*) or a question mark (?) that can be used to represent one or more characters in pattern matching
wordlist: a list of words occurring in a corpus, possibly with frequency information
WordSmith: a corpus exploration package with sophisticated statistical analysis, published by the Oxford University Press
WSC: the Wellington Corpus of Spoken New Zealand English
WWC: the Wellington Corpus of Written New Zealand English
Xaira: XML Aware Indexing and Retrieval Architecture, a new XML-aware version of SARA that can work with different corpora
Xanadu: an X-windows interactive editor for anaphoric annotation, developed at Lancaster UCREL
XCES: XML Corpus Encoding Standard
XML: the Extensible Markup Language
z-test: an alternative statistical test to chi-square test
[本贴已被 xujiajin 于 2005年08月18日 22时18分58秒 编辑过]
it is a very long list! mr. xiao has made significant contrbution to the list! it will be better for us to compile a list with chinese and breif introduction to the very new term in corpus linguictcs! on th one hand it will be better for the readers in teh field and also it will be better for translators to decide upon which words to use in the target language! thanks a lot! mr.xiao!
I have an idea of compiling a wordlist/glossary of corpus linguistics.
We can collect quite a number of corpus linguistics papers and work out a wordlist with a concordancer we have.
回复:E-C Bilingual Corpus Linguistics glossary

以下是引用 xujiajin2005-8-24 20:01:24 的发言:
I have an idea of compiling a wordlist/glossary of corpus linguistics.
We can collect quite a number of corpus linguistics papers and work out a wordlist with a concordancer we have.

Good idea. This can be a on-going process, and the list will grow longer as we move on.
回复:E-C Bilingual Corpus Linguistics glossary

以下是引用 xujiajin2005-8-24 13:14:04 的发言:
cluster 有“词从”和“词簇”两种翻译
今天,frankliang告诉我,ngram (N-gram)有人翻成n元(N元)。果然,晚上回家在黄昌宁、李涓子的《语料库语言学》书上看到他们用的正是N元
下面是一些我从2004年出版的计算语言学概论 商务出版社 俞士汶先生主编 常宝宝和詹卫东编辑的书中摘录的词汇表

1 F-measure F-评价
2 LR Parsing LR分析
3 ontology 本体知识库
4 labaled tree 标记树
5 SGML standard generalized mark-up language 标准通用标记语言
6 Boolean Model 布尔模型
7 Partial Parsing部分句法分析
8 test set 测试集
9 word alignment 词对齐
10 wordnet 词网
11 pointwise mutual information 点式互信息
12 definite clause grammar, DCG 定子句语法
13 precision at cutoff 断点处的准确率
14 word sense disambiguation WSD 多义词歧义消解
15 multi-engine machine translaton 多引擎机器翻译
16 bigram 二元模型
17 fertility probablity 繁殖概率
18 flip-flop algorithm 反转算法
19 backward maximum matching 反向最大算法
20 nonterminal 非终结符
21 categorical grammar 范畴语法
22 finite-state cascade 分层有限状态自动机
23 classification model 分类模型

[本贴已被 作者 于 2005年10月04日 13时40分58秒 编辑过]
回复:Joint E-C Bilingual Corpus Linguistics glossary

以下是引用 xujiajin2005-9-17 0:08:33 的发言:
今天,frankliang告诉我,ngram (N-gram)有人翻成n元(N元)。果然,晚上回家在黄昌宁、李涓子的《语料库语言学》书上看到他们用的正是N元

n元语法现在看来是一种误解。因为gram并不是语法,而是一个lexical(ized) unit。
overuse 超用, 使用过度,过度使用
underuse 少用,使用不足,过少使用
non-native feature ? 我的翻译 “非本土特征”
parsign, to parse, and parser by Dr.xujiajin
original url
parsing 句法分析
n. /'pa:zing/: syntactic analysis

to parse 句法分析
vt/vi. /'pa:z/: to do syntactic analysis to linguistic forms/structures

parser 句法分析器
n. /'pa:zE/: a mechanism or tool that does the syntactic analysis

Parsed data refer to texts that have been syntactically annotated.
"Tagged" here means POS-tagged--Part-of-Speech tagged.

But tagging is used by some as a generic term for linguistic annotations at all levels.