[讨论]WordSmith Tools v4.0处理中文要加空格



[本贴已被 作者 于 2005年10月03日 23时19分54秒 编辑过]
对!可以使用ACWT的汉语分词器进行空格,而后进行处理!http://www.corpus4u.com/forum_view.asp?forum_id=7&view_id=420这里也讨论了这个问题--是在讨论wordsmith3 中处理汉语怎么办,可以参考一下!羡慕中---有wordsmith4这么好的东东!
回复:[讨论]WordSmith Tools v4.0处理中文要加空格


以下是引用 lngzlz2005-10-3 23:19:34 的发言:

[本贴已被 作者 于 2005年10月03日 23时19分54秒 编辑过]
4. 哪些软件可以自动作编码转换(GB/BIG5/UTF-8/UNICODE=UTF-16)?

a) Multilingual Corpus Tool by Scott Piao, 成批转换

b) WordSmith Tools 4, GB/BIG5 -> UNICODE (UTF-16) 成批转换

c) 南极星NJ Star 文本转换器, 单个转换

d) b) Chinese Annotation Tool可在线处理简体汉语文本, 单个转换
Perl 版本:http://www.mandarintools.com/segmenter.html

e) MS Word/Notepad, 单个转换

Find more at

Character encoding in corpus construction
Dr. Xu, I fail to find "WordSmith Tools 4, GB/BIG5 -> UNICODE (UTF-16) 成批转换". Could you help me locate it in WordSmith Tools 4?
In WordSmith 4, go to "Utilities - Text converter" in the main menu.

Check Text conversion Activated.

Select the filefolder to be converted and make other adjustments as desired (keep the original data and create an extra coppy in the temp directory?).

In the Conversion type, select "into Unicode based on", and select "Chinese (People's Republic of).

Click on the OK button at the bottom.
XiaoZ, thanks a lot for your timely help. But concord still does not work OK with Chinese Unicode-encoded file. It seems that the Chinese text must be segmented with blanks before going on to use the functions of Concord, Keyword and Wordlist.
The best thing to do with Chhinese text is, of course, to tokenise the data. Yet you still need to convert the data into Unicode. Running texts without segmentation will work with Concord, not Wordlist or Keyword. (I only tested running test with Concord.)

The problem you encountered probably has to do with settings. You will need to select language and font properly before loading the texts (in Adjust settings).
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

Thank you very much, Dr. XiaoZ. But look at my screen explanation:


[本贴已被 作者 于 2005年10月05日 23时42分12秒 编辑过]
The best practice is to avoid using Chinese characters in filenames.
我按照大家的帖子用wordsmith3实验了一下。发现,经过分词后的中文,concord 没有问题。不过,使用其他功能就不行。我先用ictclas分词, 然后使用NOTEPAD将中文转换成了“unicode文档”。我的操作系统是windows me ,所以不知道转换后的unicode 是8还是16,或者都不是。经过unicode转换和分词处理的中文在用wordsmith3。0进行wordlist功能时,发现不行。
The discussions in this thread are related to Wordsmith 4, but version 3.

With version 3, as long as your Chinese data are tokenised, there is no need to convert to Unicode - By the way, if you use "save as" and select "Unicode document" in Notepad, it is Unicode (UTF-16). But you do need a simplied Chinese version of Windows or a language pack. Still only Concord will work, but not Wordlist or Keyword.

In Wordsmith 4, you must convert Chinese data into Unicode (you can use the Text Converter in Utilities of WS4). In this case, even texts not tokenised will work with Concord. But the data must be tokenised if you want to make a wordlist or extract keywords.
Dear Dr. Xiao, I changed the file name into English. It still pops up window which says "no concordance entries found". Could you upload several screenshots for me to follow your operations? Thanks a lot in advance.
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

以下是引用 xiaoz2005-10-6 8:55:05 的发言:
even texts not tokenised will work with Concord.
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

The problem you have encountered, I suspect, is most likely that your data have not been converted into Unicode properly. You have to the Text Converted in WST4 to do the conversion (see my earlier replies). Or you can use MLCT to convert GB2312 (GBK) data into UTF-8 (not UTF-16, Mike cannot explain why the UTF-16 data converted using MLCT can be processed by wordsmith 4), and then click on the icon for "test Unicode" when choosing texts.

Here are a few screen dumps that show how WST4 works well with untokenised Unicode Chinese data.





[本贴已被 作者 于 2005年10月06日 10时40分28秒 编辑过]
Thanks for Dr. Xiao's timely help. If possible, could you upload a small part of your untokenised.xml for me to have a test. I am sure I have done my data conversion in a correct way, maybe not?
you have tokenize your data, that's the problem. why not tokenize your data first, then have a try?
请各位赐教:没有注册号的Wordsmith 4.0是不是只能显示25条检索行啊,我使用concord检索时,第26行显示的是“past demo limit”,请问,这个问题怎么解决呢?thanks a lot in advance!