语法与篇章:POS tags in written and spoken Chinese

回复:POS tags in written and spoken Chinese

以下是引用 xiaoz2005-11-11 0:27:40 的发言:
In reply to 14:
Are you sure such usages do not exist in written Chinese (ICTLAS tags 嗯 as e)?

Yes, this is exactly the problem with ICTLAS. Not all cases of 嗯 can be categorizes like English filled pause uh, um, etc. Chinese 嗯 has some ten types of discourse functions. As such its grammatical identity can be very hard to determine.
 
回复:POS tags in written and spoken Chinese

I would suggest that we do not conflate POS tagging and discourse tagging. The latter can only be reliably annotated by hand. In English, the discourse functions of "oops" include, e.g. expressing mild apology, shock, or dismay. But for the POS tagging purpose, it is tagged as an interjection (ITJ in the BNC).

以下是引用 xiaoz2005-11-11 0:27:40 的发言:
In reply to 14:
Are you sure such usages do not exist in written Chinese (ICTLAS tags 嗯 as e)?

Yes, this is exactly the problem with ICTLAS. Not all cases of 嗯 can be categorizes like English filled pause uh, um, etc. Chinese 嗯 has some ten types of discourse functions. As such its grammatical identity can be very hard to determine.
 
回复:POS tags in written and spoken Chinese

Very likely. ICTCLAS often leaves some uncommon words untagged (which I picked up and tagged by hand in my corpora). But this happens to spoken as well as written data.

以下是引用 xujiajin2005-11-11 0:39:23 的发言:
以下是引用 xiaoz2005-11-11 0:32:32 的发言:
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining for such data.

Are there any words in spoken data which cannot be POS-tagged? If so, they become outlaws of the natural language. Why are they discriminated and expelled from grammatical analysis?
 
Then any good idea of tagging the bold items in the squared brackets when machines finally fail us?
1. 算错了,[]加上[就是]喽
2. 他[]很厉害
3. [][1.0]我就不知道了
4. [呵呵],咦,谁要非要给你加
5. 那就142唉,[]高噢…… 这个[好]是副词
6. 你要再[],你要the嗯阿,那个的话,你就你就[] [闪]为口语新词
7. 小玲:70岁,打打麻将,[]?
小峰:[]
小玲:80岁,晒晒太阳
小峰:[]
小玲:90岁,躺在床上,一百岁,挂在墙上
小峰:[]
小玲:[]
小峰:(哼小曲)
8.
小玲:真的啊?
小峰:啊,放心,难不倒我,[真是],随便划划,刚唱(听不清楚)好好做吧
小玲:嘻嘻
 
回复:POS tags in written and spoken Chinese

I was using "correct" for learner data. Spoken and learner data types are dificult for taggers.

以下是引用 xujiajin2005-11-11 0:36:15 的发言:
Disfluency is not "wrong" in the sense of natural language.
 
回复:POS tags in written and spoken Chinese

以下是引用 xiaoz2005-11-11 0:32:32 的发言:
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining foe such data.

Disfluency (including repetitions, omissions, pauses and many other tongue slips) is ill-formed in syntactic terms, but they are pychologically real.
 
回复:POS tags in written and spoken Chinese

以下是引用 xiaoz2005-11-11 1:04:11 的发言:
I was using "correct" for learner data. Spoken and learner data types are dificult for taggers.

Yes, Chomsky finds also difficult to discuss natural speech. So he never gives even a slint at the sloppy linguistic performance. While with corpus data, we cannot turn a blind eye to them.
 
In reply to 24.

No good idea. The only idea is to manually tag the words machine fails, as I did in LCMC.

Also, if you want to differentiate between different discourse functions, I would suggest developing an annotation scheme and searching for and tagging relevant items by hand in a text editor. I tagged all aspect markers in LCMC using my own scheme instead of the ICTCLAS tags.
 
In reply to 26 -
Agree that these are natural phenomena in speech. Another spoken feature hard for automatic processing is truncations. A truncated word in Chinese can become another word or a word-forming morpheme while a truncated word in English can become another word or non-word. I recall that in ICE-GB the parser ignores such disfluency for parsing purposes, but for the sudy of spoken language, all such features are important.
 
In reply to 27 -

Yes agreed. We cannot ignore "inconvenient data" in corpora. In the case of ICE-GB, pragmatism takes over by ignoring such data in parsing. It must be accepted that whether to annotate a corpus, and what types of annotation are included, are determined by the resarch questions a corpus is intended to address. Such decisions are also affected by existing technologies. A balance must be striken between perfection and pragmatism.

A little endnote - I am not like Chomsky who turns a blind eye to performace data.
 
The debate over the issue of whether POS categories apply to both written and spoken registers reminds of the "differentness" vs. "sameness" approaches to the study of English grammar. The "differentness" approach is taken by the Nottingham School while the "sameness" approach by Biber et al (1999) - see another post in the forum. But still, I think POS categories are different from grammatical categories, with the first dealing with words/tokens whereas tha latter with grammatical structures. Grammatical structures can be vastly different in writing and speech, but words are the same in both writing and speech.
 
刚才又看到一个例子,这里的什么该怎么标词性。

都是[什么]乱七八糟的。
 
Similar uses are in fact not rare in written Chinese. If distintions between discourse functions are to be made, they can be made in both written and spoken registers.

闯进 来 的 人 一 脸 凶恶 , 你 也 不 看看 这 是 什么 地方 !
听 小伙子 这么 一 说 , 红杏 大爷 才 联想 起 这 几 天 听 她 闺女 晚上 回家 唠叨 的 , 什么 抢购 风 什么 的 。
画 个 其他 什么 不伦不类 的 图形 呢 ?
再说 , 彩电 价格 那么 高 , 老百姓 买不起 骂娘 , 谈 什么 稳定 ?
倘若 这 这样 的 两 个 具体 问题 都 解决 不 了 , 还 何 谈 什么 真抓实干 ?
 
Some naive questions:

Again words like 什么 here cannot be pigeonholed grammatically, can they?

In other words, can they be deleted and the sentence still makes good sense?

Or can we say words in written and spoken language are either grammatical or discoursal?
 
回复:语法与篇章:POS tags in written and spoken Chinese

In my view, grammar and discourse are two separate perspectives of linguistic analysis. Grammatical functions and discource functions must not be conflated. POS annotation is grammatical whereas discourse annotation is discoursal, though both types of analysis can be undertaken in the same corpus.

以下是引用 xujiajin2005-11-11 21:41:51 的发言:
Some naive questions:

Again words like 什么 here cannot be pigeonholed grammatically, can they?

In other words, can they be deleted and the sentence still makes good sense?

Or can we say words in written and spoken language are either grammatical or discoursal?
 
回复:语法与篇章:POS tags in written and spoken Chinese

My 2 cents:

-Theoretiacally there should be separate taggers for written and spoken language; in
reality, however, it is very difficult, if not impossible, to come up with these taggers.
The main problems are that 1) there is far little research on spoken discourse compared
to written discourse, and 2) spoken and written are relative anyway.

-Currently if we use a tagger that is based on the written language, the main problem
to me is that many multiple-word expressions that function as a single word will be dismentled
because they are not common in the written language. Such examples may include:
就是说,真是的, 那什么,那谁,- a tagger can easily split them into multiple words while in
reality they function as single words.
 
回复:语法与篇章:POS tags in written and spoken Chinese

以下是引用 动态语法2005-11-15 0:35:14 的发言:
-Currently if we use a tagger that is based on the written language, the main problem
to me is that many multiple-word expressions that function as a single word will be dismentled because they are not common in the written language. Such examples may include: 就是说,真是的, 那什么,那谁,- a tagger can easily split them into multiple words while in reality they function as single words.

Agree. That is why I prefer a character-based tokenization for spoken corpora, and multi-word units as Chinese word or phrases.
 
Back
顶部