实现不同编码之间的自如切换？可以吗？

patricx · 2005-07-25

Dr.xiaoz 说过，WORDSMITH诸如国外的检索软件都是识别以UNICODE编码的。但确又不能解决分词和插入空格的问题。我的实践正好与Dr.xujiajin相吻合了。也就是借助我们中国人设计的软件先对要处理的文本进行预先处理，比如分词的问题。但是ICTCLAS（汉语文本词性标注标记工具）只能对于ASCII编码的汉语文本进行分词，生成的也是ASCII编码的文本。然后再使用WORDSMITH进行处理，这里就涉及到文本编码的切换问题，对于一个文本，我们可以选择另存为的方式，但是对于大量的文本，这样做，自然不现实。确实是一个问题。Dr.xiaoz推荐了Multilingual corpus tool (MLCT)。

听说要解决汉语算法的问题，最好选择ASCII编码。但我们能够实现不同编码间的自如转换吗？还是鱼和熊掌的关系呢？

xiaoz · 2005-07-25

It is very funny that the Unicode tokenisation rules treat each Chinese character as a word, but tools like WST4 and Xaira require Unicode of non-ASCII scripts. Here is my practice of processing Chinese data.

1) Using segmentation tools like ICTCLAS to tokenise GB data into words (and also POS tagging if necessary);

2) Converting the processed data into the required annotation stlye to suit the requirement of concordancers (e.g. XML for Xaira and the LOB style (token_tag for MonoConc; WST accept both);

3) using encoding converstion tools like MLCT to unicodify data. MLCT is reocmmended is it was written in Java which fully supports Unicode.

4) using appropriate concordancers to analyse data.

patricx · 2005-07-25

thanks Dr.xiao, you always can find the right resolution.

patricx · 2005-07-26

回复：实现不同编码之间的自如切换？可以吗？

以下是引用 xiaoz 在 2005-7-25 20:08:25 的发言：
It is very funny that the Unicode tokenisation rules treat each Chinese character as a word, but tools like WST4 and Xaira require Unicode of non-ASCII scripts. Here is my practice of processing Chinese data.

1) Using segmentation tools like ICTCLAS to tokenise GB data into words (and also POS tagging if necessary);

2) Converting the processed data into the required annotation stlye to suit the requirement of concordancers (e.g. XML for Xaira and the LOB style (token_tag for MonoConc; WST accept both);

3) using encoding converstion tools like MLCT to unicodify data. MLCT is reocmmended is it was written in Java which fully supports Unicode.

i have two qestions:
1) if we use ICTCLAS to segmentize GB data into words (and also POS tagging if necessary); we can process only one text one time, if our data is large, it is not a good way.

2)if WS4 can put encoding converstion tools like MLCT to unicodify data, that's better. the operation will become much simpler.

of course, if Mike Scott can integrate the two functions above(segmentation and encoding conversion) , that's the best news.

动态语法 · 2005-07-26

回复：实现不同编码之间的自如切换？可以吗？

以下是引用 patricx 在 2005-7-26 11:04:51 的发言：
i have two qestions:
1) if we use ICTCLAS to segmentize GB data into words (and also POS tagging if necessary); we can process only one text one time, if our data is large, it is not a good way.

First of all, ICTCLAS is probably as fast as you can get. What you
need to do is simply click on the file name and, if the file has a
reasonable size, the segmentation is done. That to me is pretty fast
(unless you have a very slow system).

Now, that said if you really want to do it fast, you could merge (with
TextPro?) all your text files (or some sub-sets of) into a super file and
click once and sit back and relax. But of course this is a quick and dirty
way of doing it as you will have to sacrifice your original text boundaries.

以下是引用 patricx 在 2005-7-26 11:04:51 的发言：

2)if WS4 can put encoding converstion tools like MLCT to unicodify data, that's better. the operation will become much simpler.

of course, if Mike Scott can integrate the two functions above(segmentation and encoding conversion) , that's the best news.

WST4 already has the UNICODE conversion functions builtin.
See the screen grab below.

(By the way, you probably have noticed, WST4 can also do
batch conversion of MS DOC to Text.)

As for segmentation of Chinese text, it's probably not going
to be part of WST, or any time soon, realistically speaking.

[本贴已被作者于 2005年07月26日 12时16分14秒编辑过]

patricx · 2005-07-26

great! thanks!

patricx · 2005-07-26

回复：实现不同编码之间的自如切换？可以吗？

以下是引用动态语法在 2005-7-26 11:43:58 的发言：

WST4 already has the UNICODE conversion functions builtin.
See the screen grab below.

have you tested this function,动态语法? that is the result of my texts:

xiaoz · 2005-07-26

1) Before we find a tagger/tokeniser that can process files in a batch, we have to go with tools like ICECLAS which process one file a time. Indeed, we can merge files of the same type into a file of reasonable size as 动态语法 suggested. Therefore, indead of processing 500 samples in LCMC, I processed 15 files, each for each category.

2) WST4 can indeed convert many native encodings into Unicode. Unfortunately it does not include Simplified Chinese. Chinese [1028] actually refers to Traditional Chinese (Big-5), not GB2312 or GBK. That's why the text in the above figure is unreadable. Before Scott includes Simplified Chinese in the conversion tool (I will suggest that to him), we can use MLCT.

3) Word tokenisation for Chinese is not a trivial task. Not many languages need tokenisation as Chinese does. So I don't think WST4 should include a Chinese segmenter.

xiaoz · 2005-07-27

Good news! The Text Converter in WST4 will hopefully correct this bug and allow all languages supported on your Windows system to be converted into Unicode. See Mike Scott's reply below:

Oops -- sorry. Just checked the source code again. Indeed, the Language Chooser shows all the languages and Text Converter only shows 1 version of each language! I will correct this in the next upload, that is asuming it doesn't screw something else up in WS4!

xujiajin · 2005-07-28

Waiting for the debugged new upload of WS4.

patricx · 2005-07-29

回复：实现不同编码之间的自如切换？可以吗？

以下是引用 xujiajin 在 2005-7-28 0:27:29 的发言：
Waiting for the debugged new upload of WS4.

cheers! WS4 has already corrected the bugs and now we can switch from Chinese encoding GB into Unicode freely by using the tool "Text Converter"

xiaoz · 2005-07-29

May I remind you use the safe mode please to protect your data.

patricx · 2005-07-29

you mean the new version of WS4 is not safe? what is your exact meaning?

xiaoz · 2005-07-29

If you choose to overwrite original files, you will only have a copy of data in Unicode. In safe mode (i.e. using a temp directory), your original data will be intact while you also get a copy of Unicode data in the temp directory.

patricx · 2005-07-29

that's right. thanks for reminding of this. it's really important. when i used Textpro to insert empty spaces between Characters, it really damage my original data. i have already changed to the safe mode as you mentioned. thanks again. it's very kind of you!!!!

实现不同编码之间的自如切换？可以吗？

patricx

高级会员

xiaoz

永远的超级管理员

patricx

高级会员

patricx

高级会员

动态语法

管理员

patricx

高级会员

patricx

高级会员

xiaoz

永远的超级管理员

xiaoz

永远的超级管理员

xujiajin

管理员

patricx

高级会员

xiaoz

永远的超级管理员

patricx

高级会员

xiaoz

永远的超级管理员

patricx

高级会员