语法与篇章:POS tags in written and spoken Chinese

How can you be sure that written Chinese informed POS tagset can be applied to LLSCC?

I don't think ICTCLAS can be trusted for the POS tagging results without any careful hand editing.
 
I agree with Dr Xu. Part of speech is still unsettled both in theory and in practice. It does need careful hand editing.
 
I guess Richard used freeICTCLAS.
http://www.corpus4u.com/forum_view.asp?view_id=557&forum_id=8
中科院计算所汉语词法分析系统ICTCLAS
 
回复:POS tags in written and spoken Chinese

Tagsets can of course vary from tagger to tagger, or even for the same tagger (e.g. BNC C1-8 tagsets for CLAWS). But I think a tagset for a particular language can apply to both written and spoken registers, but the a tagger trained with written data must be adjusted to tag spoken data. For example, CLAWS was adjusted when the spoken BNC was tagged (but using the same tagset). A tagger may also need adjusting when learner data is tagged.

ICTCLAS achieved an accuracy rate of over 97% for written general Chinese, particularly news texts with which it was trained. But for spoken Chinese, my experiments showed an accuracy rate of 85-95%, varying across spoken genres. In the frequencies posted, the written corpus was hand checked but the spoken corpus was not.

以下是引用 xujiajin2005-11-10 12:11:38 的发言:
How can you be sure that written Chinese informed POS tagset can be applied to LLSCC?

I don't think ICTCLAS can be trusted for the POS tagging results without any careful hand editing.
 
回复:POS tags in written and spoken Chinese

以下是引用 xiaoz2005-11-10 21:54:43 的发言:
But I think a tagset for a particular language can apply to both written and spoken registers, but ...
----Cannot agree.


ICTCLAS achieved an accuracy rate of over 97% for written general Chinese, particularly news texts with which it was trained. But for spoken Chinese, my experiments showed an accuracy rate of 85-95%, varying across spoken genres.

-----You did a benchmark test? The lowest accuracy rate 85 percent is a big difference from 97 percent as I see it in a one million corpus.
 
回复:POS tags in written and spoken Chinese

Well, the view that "Part of speech is still unsettled both in theory and in practice" is an over-statement. For many languages including Chinese, POS tagging has met with great success, with a typical error rate of 3% for general written language. the automatically POS tagged data is sufficiently reliable for many applications.

Of course, as wzli said in another post, raw texts/transcripts are useful. The POS tags in a corpus can be easily removed if you prefer a plain text corpus. But for Chinese, it appears that most corpus exploration tools require tokenised data, which is not plain at all, because tokenisation typically goes through a process similar to POS tagging (the former is the basis of the latter). That means that unless you totally reject Chinese corpus data, some processing which is "unsettled both in theory and in practice" is inevitable.

以下是引用 yinghuang2005-11-10 12:31:52 的发言:
I agree with Dr Xu. Part of speech is still unsettled both in theory and in practice. It does need careful hand editing.
 
回复:POS tags in written and spoken Chinese

Yes, that's right.

以下是引用 xujiajin2005-11-10 21:40:48 的发言:
I guess Richard used freeICTCLAS.
http://www.corpus4u.com/forum_view.asp?view_id=557&forum_id=8
中科院计算所汉语词法分析系统ICTCLAS
 
回复:POS tags in written and spoken Chinese

1) I assume the POS categories are the same for the same language, but the algorithms/rules of a tagger with written data must be retrained/rewritten for speech. Can you give some examples of POS categories that exist only in writing but not in speech or vice versa?

2) The significantly lower accuracy rate (85%) of ICTCLAS for some spoken genre in my corpus, namely the Callhome Mandarin component, because the LDC had already marked up this part for proper nouns and many spoken features which I preferred to retain in the POS tagged version. While this preprocessing on the part of LDC seriously affected the accurancy rate of POS tagging, it is useful for the studies of spoken Chinese. For the other subcorpora in LLSCC, the tagging accuracy is very close to that for the written language.

以下是引用 xujiajin2005-11-10 22:04:33 的发言:
以下是引用 xiaoz2005-11-10 21:54:43 的发言:
But I think a tagset for a particular language can apply to both written and spoken registers, but ...
----Cannot agree.

ICTCLAS achieved an accuracy rate of over 97% for written general Chinese, particularly news texts with which it was trained. But for spoken Chinese, my experiments showed an accuracy rate of 85-95%, varying across spoken genres.

-----You did a benchmark test? The lowest accuracy rate 85 percent is a big difference from 97 percent as I see it in a one million corpus.
 
Your discussion here is very informative. Can Richard please explain a bit more in what way to map the ICTCLAS tagset (for Chinese text) to CLAWS tagset (for English text)?
 
回复:POS tags in written and spoken Chinese

Well there may not be direct correspondences between tagsets for different languages. While some POS categories are shared by English and Chinese, others are not (e.g. articles in English and 助词 in Chinese). I used CLAWS tagger and different versions of the associated tagsets to show that different tagsets can be applied for one language or using the same tagger. But within one language, a well designed tagset can apply to both written and spoken registers.

以下是引用 laohong2005-11-10 23:50:54 的发言:
Your discussion here is very informative. Can Richard please explain a bit more in what way to map the ICTCLAS tagset (for Chinese text) to CLAWS tagset (for English text)?
 
回复:POS tags in written and spoken Chinese

以下是引用 xiaoz2005-11-10 22:25:58 的发言:
Can you give some examples of POS categories that exist only in writing but not in speech or vice versa?

我从我的SCOUT中随便找了几个例子,不知道可不可说明一点问题?
1. 算错了,[]加上[就是]喽
2. 他[]很厉害
3. [][1.0]我就不知道了
4. [呵呵],咦,谁要非要给你加
5. 那就142唉,[]高噢…… 这个[好]是副词
6. 你要再[],你要the嗯阿,那个的话,你就你就[] [闪]为口语新词
7. 小玲:70岁,打打麻将,[]?
小峰:[]
小玲:80岁,晒晒太阳
小峰:[]
小玲:90岁,躺在床上,一百岁,挂在墙上
小峰:[]
小玲:[]
小峰:(哼小曲)
8.
小玲:真的啊?
小峰:啊,放心,难不倒我,[真是],随便划划,刚唱(听不清楚)好好做吧
小玲:嘻嘻
 
An example that best illustrates the need to retrain tagging algorithm and rewrite tagging rules is mm in English: as a noun measurement unit, or as an interjection. In the public release of the BNC which was tagged using a version of CLAWS retrained for spoken English, only 10 instances of mm were tagged as a noun in the four million words of demographically sampled component of the corpus; for the same part of the corpus tagged using the standard version of CLAWS, 2271 instances were tagged as a noun.
 
In reply to 14:
Are you sure such usages do not exist in written Chinese (ICTLAS tags 嗯 as e)?
 
回复:POS tags in written and spoken Chinese

Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining foe such data.

以下是引用 xujiajin2005-11-11 0:25:04 的发言:
The so-called disfluency abounds in natural speech.
 
回复:POS tags in written and spoken Chinese

以下是引用 xiaoz2005-11-11 0:32:32 的发言:
Isn't disfluency mirrored by repetitions, omissions, pauses etc? Such features can be marked up but are NOT POS categories. They can affect the accuaracy of tagging designed for mostly 'correct' and fluent language data. That's why I said there is a need for retraining for such data.

Are there any words in spoken data which cannot be POS-tagged? If so, they become outlaws of the natural language. Why are they discriminated and expelled from grammatical analysis?
 
Back
顶部