Discourse Connectives in the Chinese Treebank


Staff member
Annotating Discourse Connectives in the Chinese Treebank

Nianwen Xue
Department of Computer and Information Science
University of Pennsylvania


In this paper we examine the issues that arise from the annotation of the discourse connectives for the Chinese Discourse Treebank Project. This project is based on the same principles as the PDTB, a project that annotates the English discourse connectives in the Penn Treebank. The paper
begins by outlining range of discourse connectives under consideration in this project and examines the distribution of the explicit discourse connectives. We then examine the types of syntactic units that can be arguments to the discourse connectives. We show that one of the
most challenging issues in this type of discourse annotation is determining the textual spans of the arguments and this is partly due to the hierarchical nature of discourse relations. Finally, we discuss sense discrimination of the discourse connectives, which involves separating discourse
connective from non-discourse connective senses and teasing apart the different discourse connective senses, and discourse
connective variation, the use of different connectives to represent the same discourse relation.


Staff member
A similar paper
Enhancement of a Chinese Discourse Marker Tagger

Benjamin K. Tsou, Tom B. Y. Lai, Samuel W. K. Chan, Weijun Gao, Xuegang Zhan
rlbtsou, cttomlai}@uxmail.cityu.edu.hk, swkchan@cs.cityu.edu.hk,
wjgao@mail.neu.edu.cn, zxg@ics.cs.neu.edu.cn

Discourse markers are complex discontinuous linguistic expressions which
are used to explicitly signal the discourse structure of a text. This paper describes
efforts to improve an automatic tagging system which identifies and classifies
discourse markers in Chinese texts by applying machine learning (ML) to the
disambiguation of discourse markers, as an integral part of automatic text summarization via rhetorical structure. Encouraging results
are reported.
Keywords: discourse marker, Chinese corpus, rhetorical relation, automatic tagging,
machine learning