Mark-up versus Annotation

xujiajin

管理员
Staff member
#1
Mark-up (or markup) versus Annotation
Cited and adapted from http://lingo.lancs.ac.uk/devotedto/corpora/help.htm
David Lee's Bookmarks for Corpus-based Linguists

Mark-up = tags (added character strings) used to code the structural or surface format/renditional attributes of a text (e.g., headings, sections, page breaks, sentences, bold/italics, speaker ID, speaker turns, pauses), OR non-interpreted aspects of the situated context of the discourse (e.g. bibliographical or demographic details about the author or speaker, location of speech event, genre, etc., and also gestures, laughter, voice quality, and events such as "writes on blackboard"). In HTML/SGML/XML (mark-up languages), mark-up is always within angled brackets.
按照IT的术语,mark-up常常被译作“标记”(因为术语翻译的混乱我想我们还是直接用英文吧)。从上面的定义来看,mark-up主要是指伴随语料本身的一些基本信息,如,文本书写、编辑中的段落、语言使用者的身份、性别等社会特征,以及伴随口语语料的情境因素和副语言特征等的标记。这些特征信息常常写在语料库的头文件,或者在文中相应的地方以箭头括号标出,见下例。

CLEC example:
<ST 2> <SEX ?><Y ?> <SCH GDWYWMDXFSWYXX> <AGE ?> <WAY ?><DIC ?> <TYP 2>
 

xujiajin

管理员
Staff member
#2
Annotation = a subset of mark-up; tags (added character strings) used to code 'value-added' or interpreted information, derived through analysis by humans or machines; usually added for research purposes. The most common annotations are part-of-speech (POS) tags, lemmas, semantic tags, discourse-level/pragmatic tags.
而annotation(也被译作“标记”或赋码),主要指的是annotator对语料所施行的语言学解读。比如对单词词性的判断,语义特征的判断,话语语用特征的判断,音韵特征的分析所做出的标注。
从Leech发表的文章登来看,他一向喜欢7这个数字。他也将annotation分为了7个层次:
基本转写层标注(orthographic annotation)(对于这种提法,详见Leech (1993; 1997))、音位音系标注(phonetic/phonemic annotation)、韵律标注(prosodic annotation)、词性标注(Part-of-Speech―POS annotation,通常说的tagging专指这一层次的标注)、句法标注(syntactic annotation,通常说的parsing专指这一层次的标注)、语义标注(semantic annotation)、语用/话语标注(pragmatic/discourse annotation)
BROWN
|SA01:1 the_AT Fulton_NP County_NN Grand_JJ Jury_NN said_VBD Friday_NR an_AT investigation_NN of_IN Atlanta's_NP$ recent_JJ primary_NN election_NN produced_VBD no_AT evidence_NN that_CS any_DTI irregularities_NNS took_VBD place_NN ._.

CLEC
He was right. The headmaster who were [vp3, 4-] so angry made him give a talk to the whole school about his experiences abroad. [sn8, s]Actually. [sn9, s] The [fm3, 1-] children were very pround [fm1,-] of their little hero. He Became [fm3, 1-] the admiration of everyone.

The <ART(def)> Italian <ADJ(ge)> peoples <N(com,plu)> were <AUX(semi,past):1/3> bound <AUX(semi,past):2/3> to <AUX(semi,past):3/3> fight <V(intr,infin)> in <PREP(ge)> Rome <N(prop,sing)> 's <GENM> wars <N(com,plu)> at <PREP(ge)> their <PRON(poss,plu)> own <ADJ(ge)> charge <N(com,sing)>

Phonological annotation

Marked-up/annotated texts are designed for computational tractability, and not meant to be read “raw”.

最后顺带说一下tag这个词。Tag给我的第一印象是price tag,价格标签,挂在商品上表明品名、产地、等次、价格等信息的。后来这个概念被做IT的人用到计算机编程中(举个例子,计算机中的tag ^p或者</p>就表示一个段落的结束)。再后来又被移植到语料库的标注中。

大家看看有什么说得不到或不妥的地方,希望多多指出。

 

xujiajin

管理员
Staff member
#3
phonological annotation with praat


大家可以点击看大图

注释:
图中第1层中C代表辅音,V代表元音。
第4层中sil表示silence。
第6层%H, L采用的是美国学者Pierrehumbert等人对韵律单位的切分定义。
 
顶部