Hanzi (Chinese character) segmenter

本文由 xujiajin2005-10-21 发表於 "编程与工具开发" 讨论区

  1. xujiajin

    xujiajin 管理员 Staff Member


    Hanzi (Chinese character) segmenter
    (Last modified: 2000-10-22)

    The following is a tiny Perl script which will generate a list of hanzi (i.e., Chinese character) from a GB-encoded running text. Note that only hanzi will be displayed (i.e., code space beginning from B0A1). The script is presented here for illustration purpose only.


    # By Da Jun


    # For comments and suggestions, please contact me at:


    # jun@lingua.mtsu.edu


    # Last modified: Dec. 8, 1999


    # Use at your own risk :). Freely distributable as long as this notice

    # is intacted.


    # This script segments a plain GB encoded text file (which may contain

    # other ascii codes) into a list of characters with one character per line.

    # All other codes are discarded.Output is dumped to STANDOUT with each line

    # containing one character followed by \n (newline).


    # To run the script on a unix system, do the following:


    # 1) Save it as a text file (e.g. name it as 'seggb');

    # 2) Find out where the Perl Interpreter is on your system. It is

    # usually in the /usr/local/bin folder (which is the default used

    # here) or /usr/bin (on some unix systems). The shell command "whereis perl"

    # will tell you where the Perl interpreter is on your system.

    # 3) Make the script executable by issuing the following command at the prompt:


    # chmod u+x seggb


    # Now you are ready to run the script.


    # At the prompt, issue the following command (assuming you save the script as

    # 'seggb'):


    # seggb myGBtextfile


    # in which 'myGBtextfile' is the name of any GB text file you want to segment.

    # Note that several files can be processed at the same time. e.g.,


    # seggb file1 file2 file3 ...


    # The script can also takes input from the I/O pipe. Suppose we have a textfile

    # 'fhy.txt'. We can also use the script in the following (dummy) way:


    # cat fhy.txt | seggb


    while ( $line = <> ) {
    # First, we pre-process the input line to get rid of a few known control

    # characters that may be hidden in the text file.

    $line =~ s/[ \n\r\f\t]//g;
    # Second, we want to make sure that the line is not empty (Otherwise there'll

    # be nothing to process). Note that we use line length as a test. We could

    # test if the string is empty or not by using "(if $line eq '')". But it seems

    # that using string length is better in dealing with texts that may contain a

    # mixture of both two-byte and one-byte codes. I don't know why it is the

    # case but this is what I found out in practice.
    # If the line is not empty,

    if ( length($line) ne "") {
    # we do the following:

    while ( $line ) {
    # 1) Get rid of any ascii code(s) that may be at the beginning of the line.

    while ( $line le '~' && $line ne "" ) { $line =~ s/^.//g; }
    # 2) Take the first two bytes of $line:

    $mychar = substr($line, 0, 2);
    # 3) If the two bytes stored in $mychar is GB-encoded, we send them out

    # to STANDOUT. Note that the character in the quotes is binary: B0A1

    if ($mychar ge "啊" ) { print "$mychar\n"; }
    # 4) Get rid of the first two and process the next two bytes in the line.

    $line =~ s/^..//g;
    } # End of the inner while starting from 1)

    } # End of the if at the top which tests that the line is not empty.

    } # End of the top while loop