Hanzi (Chinese character) segmenter


Staff member

Hanzi (Chinese character) segmenter
(Last modified: 2000-10-22)

The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only.


# By Da Jun


# For comments and suggestions, please contact me at:


# jun@lingua.mtsu.edu


# Last modified: Dec. 8, 1999


# Use at your own risk :). Freely distributable as long as this notice

# is intacted.


# This script segments a plain GB encoded text file (which may contain

# other ascii codes) into a list of characters with one character per line.

# All other codes are discarded.Output is dumped to STANDOUT with each line

# containing one character followed by \n (newline).


# To run the script on a unix system, do the following:


# 1) Save it as a text file (e.g. name it as 'seggb');

# 2) Find out where the Perl Interpreter is on your system. It is

# usually in the /usr/local/bin folder (which is the default used

# here) or /usr/bin (on some unix systems). The shell command "whereis perl"

# will tell you where the Perl interpreter is on your system.

# 3) Make the script executable by issuing the following command at the prompt:


# chmod u+x seggb


# Now you are ready to run the script.


# At the prompt, issue the following command (assuming you save the script as

# 'seggb'):


# seggb myGBtextfile


# in which 'myGBtextfile' is the name of any GB text file you want to segment.

# Note that several files can be processed at the same time. e.g.,


# seggb file1 file2 file3 ...


# The script can also takes input from the I/O pipe. Suppose we have a textfile

# 'fhy.txt'. We can also use the script in the following (dummy) way:


# cat fhy.txt | seggb


while ( $line = <> ) {
# First, we pre-process the input line to get rid of a few known control

# characters that may be hidden in the text file.

$line =~ s/[ \n\r\f\t]//g;
# Second, we want to make sure that the line is not empty (Otherwise there'll

# be nothing to process). Note that we use line length as a test. We could

# test if the string is empty or not by using "(if $line eq '')". But it seems

# that using string length is better in dealing with texts that may contain a

# mixture of both two-byte and one-byte codes. I don't know why it is the

# case but this is what I found out in practice.
# If the line is not empty,

if ( length($line) ne "") {
# we do the following:

while ( $line ) {
# 1) Get rid of any ascii code(s) that may be at the beginning of the line.

while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; }
# 2) Take the first two bytes of $line:

$mychar = substr($line, 0, 2);
# 3) If the two bytes stored in $mychar is GB-encoded, we send them out

# to STANDOUT. Note that the character in the quotes is binary: B0A1

if ($mychar ge "啊" ) { print "$mychar\n"; }
# 4) Get rid of the first two and process the next two bytes in the line.

$line =~ s/^..//g;
} # End of the inner while starting from 1)

} # End of the if at the top which tests that the line is not empty.

} # End of the top while loop