[讨论]WordSmith Tools v4.0处理中文要加空格

lngzlz

普通会员
第4版处理中文时,好像仍然要加空格,才能concord?

[本贴已被 作者 于 2005年10月03日 23时19分54秒 编辑过]
 
对!可以使用ACWT的汉语分词器进行空格,而后进行处理!http://www.corpus4u.com/forum_view.asp?forum_id=7&view_id=420这里也讨论了这个问题--是在讨论wordsmith3 中处理汉语怎么办,可以参考一下!羡慕中---有wordsmith4这么好的东东!
 
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

光用Concord,应该不需加空格,但Wordlist和keyword就要加。但首先必须把data转换成Unicode.


以下是引用 lngzlz2005-10-3 23:19:34 的发言:
第4版处理中文时,好像仍然要加空格,才能concord?

[本贴已被 作者 于 2005年10月03日 23时19分54秒 编辑过]
 
4. 哪些软件可以自动作编码转换(GB/BIG5/UTF-8/UNICODE=UTF-16)?

a) Multilingual Corpus Tool by Scott Piao, 成批转换
http://www.lancs.ac.uk/staff/piaosl/research/download/download.htm

b) WordSmith Tools 4, GB/BIG5 -> UNICODE (UTF-16) 成批转换

c) 南极星NJ Star 文本转换器, 单个转换
http://www.njstar.com

d) b) Chinese Annotation Tool可在线处理简体汉语文本, 单个转换
http://www-rohan.sdsu.edu/~chinese/annotate.html
Perl 版本:http://www.mandarintools.com/segmenter.html

e) MS Word/Notepad, 单个转换

Find more at
http://www.corpus4u.org/showthread.php?t=699

Character encoding in corpus construction
http://www.corpus4u.org/showthread.php?t=416
 
Dr. Xu, I fail to find "WordSmith Tools 4, GB/BIG5 -> UNICODE (UTF-16) 成批转换". Could you help me locate it in WordSmith Tools 4?
 
In WordSmith 4, go to "Utilities - Text converter" in the main menu.

Check Text conversion Activated.

Select the filefolder to be converted and make other adjustments as desired (keep the original data and create an extra coppy in the temp directory?).

In the Conversion type, select "into Unicode based on", and select "Chinese (People's Republic of).

Click on the OK button at the bottom.
 
XiaoZ, thanks a lot for your timely help. But concord still does not work OK with Chinese Unicode-encoded file. It seems that the Chinese text must be segmented with blanks before going on to use the functions of Concord, Keyword and Wordlist.
 
The best thing to do with Chhinese text is, of course, to tokenise the data. Yet you still need to convert the data into Unicode. Running texts without segmentation will work with Concord, not Wordlist or Keyword. (I only tested running test with Concord.)

The problem you encountered probably has to do with settings. You will need to select language and font properly before loading the texts (in Adjust settings).
 
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

Thank you very much, Dr. XiaoZ. But look at my screen explanation:

2005100523420041.jpg



[本贴已被 作者 于 2005年10月05日 23时42分12秒 编辑过]
 
The best practice is to avoid using Chinese characters in filenames.
 
我按照大家的帖子用wordsmith3实验了一下。发现,经过分词后的中文,concord 没有问题。不过,使用其他功能就不行。我先用ictclas分词, 然后使用NOTEPAD将中文转换成了“unicode文档”。我的操作系统是windows me ,所以不知道转换后的unicode 是8还是16,或者都不是。经过unicode转换和分词处理的中文在用wordsmith3。0进行wordlist功能时,发现不行。
 
The discussions in this thread are related to Wordsmith 4, but version 3.

With version 3, as long as your Chinese data are tokenised, there is no need to convert to Unicode - By the way, if you use "save as" and select "Unicode document" in Notepad, it is Unicode (UTF-16). But you do need a simplied Chinese version of Windows or a language pack. Still only Concord will work, but not Wordlist or Keyword.

In Wordsmith 4, you must convert Chinese data into Unicode (you can use the Text Converter in Utilities of WS4). In this case, even texts not tokenised will work with Concord. But the data must be tokenised if you want to make a wordlist or extract keywords.
 
Dear Dr. Xiao, I changed the file name into English. It still pops up window which says "no concordance entries found". Could you upload several screenshots for me to follow your operations? Thanks a lot in advance.
 
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

以下是引用 xiaoz2005-10-6 8:55:05 的发言:
even texts not tokenised will work with Concord.
 
回复:[讨论]WordSmith Tools v4.0处理中文要加空格

The problem you have encountered, I suspect, is most likely that your data have not been converted into Unicode properly. You have to the Text Converted in WST4 to do the conversion (see my earlier replies). Or you can use MLCT to convert GB2312 (GBK) data into UTF-8 (not UTF-16, Mike cannot explain why the UTF-16 data converted using MLCT can be processed by wordsmith 4), and then click on the icon for "test Unicode" when choosing texts.

Here are a few screen dumps that show how WST4 works well with untokenised Unicode Chinese data.

2005100610352494.jpg


2005100610355893.jpg


2005100610401766.jpg


2005100610362342.jpg



[本贴已被 作者 于 2005年10月06日 10时40分28秒 编辑过]
 
Thanks for Dr. Xiao's timely help. If possible, could you upload a small part of your untokenised.xml for me to have a test. I am sure I have done my data conversion in a correct way, maybe not?
 
you have tokenize your data, that's the problem. why not tokenize your data first, then have a try?
 
请各位赐教:没有注册号的Wordsmith 4.0是不是只能显示25条检索行啊,我使用concord检索时,第26行显示的是“past demo limit”,请问,这个问题怎么解决呢?thanks a lot in advance!
 
Back
顶部