ICTCLAS不能处理1M以上的大文本。

patricx

高级会员
ICTCLAS不能处理大文本又没有批处理功能。面对大量的,数以百计的文本,我该如何进行分词处理呢?还有其他的分词工具吗?请高手指点啊!!!
 
I have just processed a Chinese file over 5M using ICTCLAS.
The problem is with file size, but with some peculiarities in the Chinese texts. My experience tells me that when the Chinese texts contain very long strings of English letters (as in many English words with no white space between them - removed by some Chinese processing tools), ICTCLAS crashes; when the Chinese texts contain special characters - some of them are invisible to human eyes but the machine knows they are there - ICTCLAS crashes; when the Chinese texts contain some very long paragraphs, ICTCLAS crashes. This is a headache with freeware.
 
thanx, Dr.xiao. but i don't know what these special characters are. and how to get them out of the text. could you give me some specific directions? thanx you very much.
 
Some of them are control characters used in file transmission typically found in newswire texts. They are not visible. One of such characters which took me hours today is a non-DOS new line character. When the text is opened in Notepad, the text is in one line, but when it is opened in a concordancer, there are two lines. As that character cannot be copied and pasted into a search/replace function, and there is no way to enter it on keyboard, it can not be easily removed by searching and replacing with a null character. When I found what that really was, I saved the file as Unicode, this time it is easy to remove such blank lines. The data was then converted back to GB2312 for processing with ICTCLAS. There are many other characters of different natures.

I would suggest that you split the file that causes problem into halves, and then test it with ICTCLAS, ignore the half that can be processed, split the problematic part into halves again and test, until you find the root of the problem. Then solve that problem in the whole corpus - normally data from the data source just has the same 1-2 problems.
 
we will be very glad if we have such a tool to preprocess the raw data, getting rid of all of the messy codes or characters
 
What takes time in text processing is not programming, but treating irregularities at the pre-processing stage.
 
Back
顶部