WordSmith 3其实可以处理中文

xujiajin

管理员
Staff member
#1
WordSmith 3其实可以处理中文
刚才做了一个小小的试验。
先用FreeICTCLAS(汉语文本词性标注标记工具)(ICTCLAS所有的源代码、论文和技术文档都可以在www.nlp.org.cn 或者 www.ict.ac.cn/freeware/上免费得到。)对一个中文纯文本进行分词处理。

然后用WordSmith3就可以检索了(同理用其他的Concordancer也可以)。当然,你其间会发现有一点点异样,但得到的结果确是我们所需要的。

大家试试看行不行,有问题可以一起来讨论解决。

所以,基于英文的Concordancer之所以不能处理汉语一个重要的原因是因为汉语词与词之间没有space。分词处理(segmentation)之后这个问题就解决了。
 

xiaoz

永远的超级管理员
Staff member
#2
I noticed that the WST 3 Concord is fine with tokenized Chinese text, but not the Wordlist. If you have non-Chinese Windows XP, you can set the default non-Unicode character set as Chinese (PRC) using the following procedure:

1) Logon as a member of the Administrator group
2) Go to Control Panel - Date, Time, Language and Regional Options
3) Select Regional and Language Options
4) Click on the Languages tab
5) If you have not done so yet, check the selection box preceding Install files for East Asian language and select Chinese of the type you need (simplified for PRC Chinese and traditional for Taiwanese Chinese; you may need the Windows installation CD in this process if the installation files were not copied to your local drive when your Windows system was installed)
6) After installing Chinese and (and restarting), click on Details to install Chinese IME
7) If you like, you can also select the Advanced tab to configure the default non-Unicode system language as Chinese so that the language in menus in programs written in Chinese can be displayed correctly

When the Windows is configured in this way, your system works just like the Chinese version of Windows.

If you have an earlier version of Windows, you will need a language support package. With that WST 3 Concord also works.
 

xujiajin

管理员
Staff member
#4
正如richard所说,能不能显示汉字取决于操作系统。
而能不能像英语一样进行检索主要取决于汉字的词与词之间是否有空格,如果加上空格之后,问题自然也就解决了。
 

xujiajin

管理员
Staff member
#5
另外,请看hancunxin“[原创]中文语料库检索的福音”贴。用CLEC光盘中所提供的检索工具也试验成功了。
 

xiaoz

永远的超级管理员
Staff member
#6
CJK languages used in East Asia are all double-byte scripts. Each character is made of two bytes. Without white spaces, the second byte of a character may combine with the first byte of the next character to form another character, thus making all of the following text become rubbish. See my new posting on "character encoding" for more discussion.

When searching segmented text, the word boundaries are clearly defined, which are used by concordancers to extract words. You can also use a wildcard character such as * to search for words/phrases containing, starting, or ending with a certain character or character string.
 

xujiajin

管理员
Staff member
#8
如果不能save的话,试试看能不能在结果窗内,用Ctrl + A(全选),然后Ctrl + C(复制),然后打开MS Word再Ctrl + V(粘贴),看行不行?
 

xujiajin

管理员
Staff member
#11
回复:WordSmith 3其实可以处理中文

以下是引用 xujiajin2005-7-6 12:59:54 的发言:
如果不能save的话,试试看能不能在结果窗内,用Ctrl + A(全选),然后Ctrl + C(复制),然后打开MS Word再Ctrl + V(粘贴),看行不行?
hancunxin试过了,证明是可行。因为在Windows操作系统下进行文本编辑处理时,这几下子应该都是管用的。试试吧。
 

xujiajin

管理员
Staff member
#15
Wordlist should be generated with segmented text, I succeeded with Wconcord. There should be no problem with WordSmith too.
 

appler

初级会员
#16
“正如richard所说,能不能显示汉字取决于操作系统。
而能不能像英语一样进行检索主要取决于汉字的词与词之间是否有空格,如果加上空格之后,问题自然也就解决了。”
我用C#写了小软件
有人要的话留下E―mail
 

appler

初级会员
#19
我的这个小软件是:前几天为一个朋友写的,他也是为汉语词与词之间的空格而烦恼。
这个东东是为解决汉语词与词之间的空格问题而写的
 

xiaoz

永远的超级管理员
Staff member
#20
回复:WordSmith 3其实可以处理中文

Haven't tried Concordance, but if it is based Unicode as 动态语法 suggested, there is be no problem with this tool.

But for WordSmith 3, only Concord works on segmented Chinese texts. Wordlist, and relatedly Cluster, and Keyword, do not work.

 
顶部