http://lingua.mtsu.edu/chinese-computing/faq/segmenter.html
Hanzi (Chinese character) segmenter
(Last modified: 2000-10-22)
The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only.
--------------------------------------------------------------------------------
#!/usr/local/bin/perl
# By Da Jun
#
# For comments and suggestions, please contact me at:
#
# jun@lingua.mtsu.edu
#
# Last modified: Dec. 8, 1999
#
# Use at your own risk . Freely distributable as long as this notice
# is intacted.
#
# This script segments a plain GB encoded text file (which may contain
# other ascii codes) into a list of characters with one character per line.
# All other codes are discarded.Output is dumped to STANDOUT with each line
# containing one character followed by \n (newline).
#
# To run the script on a unix system, do the following:
#
# 1) Save it as a text file (e.g. name it as 'seggb');
# 2) Find out where the Perl Interpreter is on your system. It is
# usually in the /usr/local/bin folder (which is the default used
# here) or /usr/bin (on some unix systems). The shell command "whereis perl"
# will tell you where the Perl interpreter is on your system.
# 3) Make the script executable by issuing the following command at the prompt:
#
# chmod u+x seggb
#
# Now you are ready to run the script.
#
# At the prompt, issue the following command (assuming you save the script as
# 'seggb'):
#
# seggb myGBtextfile
#
# in which 'myGBtextfile' is the name of any GB text file you want to segment.
# Note that several files can be processed at the same time. e.g.,
#
# seggb file1 file2 file3 ...
#
# The script can also takes input from the I/O pipe. Suppose we have a textfile
# 'fhy.txt'. We can also use the script in the following (dummy) way:
#
# cat fhy.txt | seggb
#
# END OF NOTES
while ( $line = <> ) {
# First, we pre-process the input line to get rid of a few known control
# characters that may be hidden in the text file.
$line =~ s/[ \n\r\f\t]//g;
# Second, we want to make sure that the line is not empty (Otherwise there'll
# be nothing to process). Note that we use line length as a test. We could
# test if the string is empty or not by using "(if $line eq '')". But it seems
# that using string length is better in dealing with texts that may contain a
# mixture of both two-byte and one-byte codes. I don't know why it is the
# case but this is what I found out in practice.
# If the line is not empty,
if ( length($line) ne "") {
# we do the following:
while ( $line ) {
# 1) Get rid of any ascii code(s) that may be at the beginning of the line.
while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; }
# 2) Take the first two bytes of $line:
$mychar = substr($line, 0, 2);
# 3) If the two bytes stored in $mychar is GB-encoded, we send them out
# to STANDOUT. Note that the character in the quotes is binary: B0A1
if ($mychar ge "啊" ) { print "$mychar\n"; }
# 4) Get rid of the first two and process the next two bytes in the line.
$line =~ s/^..//g;
} # End of the inner while starting from 1)
} # End of the if at the top which tests that the line is not empty.
} # End of the top while loop
Hanzi (Chinese character) segmenter
(Last modified: 2000-10-22)
The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only.
--------------------------------------------------------------------------------
#!/usr/local/bin/perl
# By Da Jun
#
# For comments and suggestions, please contact me at:
#
# jun@lingua.mtsu.edu
#
# Last modified: Dec. 8, 1999
#
# Use at your own risk . Freely distributable as long as this notice
# is intacted.
#
# This script segments a plain GB encoded text file (which may contain
# other ascii codes) into a list of characters with one character per line.
# All other codes are discarded.Output is dumped to STANDOUT with each line
# containing one character followed by \n (newline).
#
# To run the script on a unix system, do the following:
#
# 1) Save it as a text file (e.g. name it as 'seggb');
# 2) Find out where the Perl Interpreter is on your system. It is
# usually in the /usr/local/bin folder (which is the default used
# here) or /usr/bin (on some unix systems). The shell command "whereis perl"
# will tell you where the Perl interpreter is on your system.
# 3) Make the script executable by issuing the following command at the prompt:
#
# chmod u+x seggb
#
# Now you are ready to run the script.
#
# At the prompt, issue the following command (assuming you save the script as
# 'seggb'):
#
# seggb myGBtextfile
#
# in which 'myGBtextfile' is the name of any GB text file you want to segment.
# Note that several files can be processed at the same time. e.g.,
#
# seggb file1 file2 file3 ...
#
# The script can also takes input from the I/O pipe. Suppose we have a textfile
# 'fhy.txt'. We can also use the script in the following (dummy) way:
#
# cat fhy.txt | seggb
#
# END OF NOTES
while ( $line = <> ) {
# First, we pre-process the input line to get rid of a few known control
# characters that may be hidden in the text file.
$line =~ s/[ \n\r\f\t]//g;
# Second, we want to make sure that the line is not empty (Otherwise there'll
# be nothing to process). Note that we use line length as a test. We could
# test if the string is empty or not by using "(if $line eq '')". But it seems
# that using string length is better in dealing with texts that may contain a
# mixture of both two-byte and one-byte codes. I don't know why it is the
# case but this is what I found out in practice.
# If the line is not empty,
if ( length($line) ne "") {
# we do the following:
while ( $line ) {
# 1) Get rid of any ascii code(s) that may be at the beginning of the line.
while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; }
# 2) Take the first two bytes of $line:
$mychar = substr($line, 0, 2);
# 3) If the two bytes stored in $mychar is GB-encoded, we send them out
# to STANDOUT. Note that the character in the quotes is binary: B0A1
if ($mychar ge "啊" ) { print "$mychar\n"; }
# 4) Get rid of the first two and process the next two bytes in the line.
$line =~ s/^..//g;
} # End of the inner while starting from 1)
} # End of the if at the top which tests that the line is not empty.
} # End of the top while loop