以下是引用 patricx 在 2005-7-26 11:04:51 的发言:
i have two qestions:
1) if we use ICTCLAS to segmentize GB data into words (and also POS tagging if necessary); we can process only one text one time, if our data is large, it is not a good way.
First of all, ICTCLAS is probably as fast as you can get. What you
need to do is simply click on the file name and, if the file has a
reasonable size, the segmentation is done. That to me is pretty fast
(unless you have a very slow system).
Now, that said if you really want to do it fast, you could merge (with
TextPro?) all your text files (or some sub-sets of) into a super file and
click once and sit back and relax. But of course this is a quick and dirty
way of doing it as you will have to sacrifice your original text boundaries.
以下是引用 patricx 在 2005-7-26 11:04:51 的发言:
2)if WS4 can put encoding converstion tools like MLCT to unicodify data, that's better. the operation will become much simpler.
of course, if Mike Scott can integrate the two functions above(segmentation and encoding conversion) , that's the best news.
WST4 already has the UNICODE conversion functions builtin.
See the screen grab below.
(By the way, you probably have noticed, WST4 can also do
batch conversion of MS DOC to Text.)
As for segmentation of Chinese text, it's probably not going
to be part of WST, or any time soon, realistically speaking.
[本贴已被 作者 于 2005年07月26日 12时16分14秒 编辑过]