如何用xaira对标注好对中文语料进行词语索引

xiaoz · 2006-02-12

xudekuan wrote at 2006-1-26 15:34:11 in the News section -

问题如标题，我试了好多次，每次得到对都是字索引，对于一个进行了词语切分并标注了词性对汉语语料，如何得到词索引呢？

谢谢！

盼望得到详细指导。
像这样的：
Node Frequency Z-score
内参 3 46.4
匿名 1 40.9
新教徒 1 40.9
迪 1 40.9
切肤之痛 1 40.9
34.3% 1 40.9
你报 1 40.9
侧面 3 34.0
高远 1 28.9
cpu 1 28.9
现实感 1 28.9
妖 1 28.9
虚幻 2 28.9
寄出 1 23.6

而不是
厂 2 6.3
幅 1 6.0
出 6 5.9
向 5 5.7

……

xiaoz · 2006-02-12

Hope the reply I gave on Xaira-Preview List is of help to you -

The Xaira Indexer follows the Unicode tokenisation rules, which by default treat each and every Chinese character as a token or "word". Without proper markup, the indexer ignores white spaces between the tokens you have already inserted using some tool. You must insert an pair of open and end tags for each token, as in

<TOK>XXX</TOK> <TOK>YYY</TOK>

(or any XML element name you like)

if you do not POS tag your corpus and format it as

<w POS="n">noun</w> <w POS="v">verb</w>

To POS tag your corpus, you will need a tagger. As you already have a tokenised corpus, inserting token tags is quite straightforward using a few lines of Perl scripts. Or if you do not programme, you can use Word or some text editor. Just Replace All one or more white space with the sequence </TOK> <TOK> and then remove the first instance of </TOK> and insert the last instance of </TOK>.

The corpus processed in this way can be indexed by defining "word break" in "special tags" as TOK. When you open the indexed corpus in the client, you will have a list of words as your tool defined instead of characters.

________________________________

From: Lou Burnard [mailto:lou.burnard@computing-services.oxford.ac.uk]
Sent: Mon 05/09/2005 22:41
To: xara-preview@maillist.ox.ac.uk
Subject: [xara-preview] [Fwd: [rt.oucs.ox.ac.uk #880324] [web generated] Using Xaira to Analyse Chinese Corpus] (fwd)

Here#'s a query that I know folks on this list can answer better than me...

k/Ticket/Display.html?id=880324 >

Web-generated message on Mon Sep 5 14:49:41 2005
Sent From: ()
Remote Ident: (using )
Script invoked from:
Submitter email: scott.grant@arts.monash.edu.au
Submitter name: Scott Grant
Submitter barcode:
____Message_follows________________________________________
Dear All,

I am a lecturer in Chinese Studies at Monash University in Melbourne Australia.
As part of a Masters level course I teach we have been looking at corpus-based
translation studies and I have introduced Xaira to my students as an example of
the type of software they can use to do some basis analysis of a DIY corpus. My
students are from a range of backgrounds and we have both Chinese, Japanese and
Arabic speakers. One small technical matter that I have come across in relation
to Chinese is the fact that I can't (without marking up) make Xaira recognise
"words" in Chinese so that I can do "Word Query" and frequency analysis. While
Xaira recognises and displays the individual Chinese characters without any
problems, The problem is that "words" in Chinese are often made up of
combinations of two or more characters. I have tried using other annotation
software on the Web to segment the characters into their combinations before
using The Xaira Indexing Tool Kit to set up the Chinese corpus, but the "Word
Query" still only shows individual characters in the frequency list. Is there
some other process that I need to go through to enable "Word Query" to display
"words" in Chinese? Your advice appreciated.

Yours,

Scott
<End of message>

如何用xaira对标注好对中文语料进行词语索引

xiaoz

永远的超级管理员

xiaoz

永远的超级管理员