Sentence Segmenter

xujiajin

管理员
Staff member
http://www.eng.ritsumei.ac.jp/asao/resources/sentseg/

Name
sentseg.pl - Sentence Segmenter

**Synopsis: Usage
$ ./sentseg.pl < InputFile > OutputFile

**Description
This perl script takes a text file as standard input and splits it up so that each sentence is on a separate line. The script, however, does not gurantee 100 accuracy because of the reasons described in the Notes. See Notes below.

**Notes
Even though the script works fine for most puposes, 100 percent accuracy is not guranteed. The script determines the place of a sentence boundary on the basis of orthographic features and does not take into consideration its context. For this reason it is indispensable to scan the output file manually after the script is executed in order to see if any irregularies have occurred.

Most errors involve abbreviations with a full stop. The script handles popular abbreviations like Mr., Ms. Dr., and D.C. correctly. It is, however, unrealistic to exhaust all possibilities. If you are going to reapeat the work in a certain genre of text, you can improve its accuracy by modifying the list of abbreviations described in the script. In order to modify the list to suit your purpose, enter new abbreviations in lines 20 and 22.
 
回复: Sentence Segmenter

哪位大下能够把它该写成能够对子目录里面所有的文件进行操作的文件,就是双击文件,对该文件所在的子目录下所有文件都断句?

谢了先
 
回复: Sentence Segmenter

Two files are needed.
if you want to do your job, double click batch.bat
The tokens in bold are added by myself to meet your need.
Have fun!

##########################
## batch.bat
##########################
perl test.pl

##########################
## test.pl
##########################
#! /usr/bin/perl
#
# Name: sentseg.pl - Sentence Segmenter
#
# Usage ./sentseg.pl < InputFile > OutputFile
#
# Variables
# @line Array of lines read from text
# $text The whole text as one variable
# $abbr1 Abbreviations that do not occur at the end of a sentence
# $abbr2 Abbreviations that can occur at the end of a sentence
# @sentence Data in sentence form stored in an array
# $i Counter
# $out Final output
# Iniatializing Variables
$abbr1="M([rs]|rs|me)|Dr|U\\\.S(\\\.A|)|[aApP]\\\.[mM]|Calif|Fla|D\\\.C|N\\\.Y|Ont|Pa|V[Aa]|[MS][Tt]|Jan|Feb|Mar|Apr|Aug|Sept?|Oct|Nov|Dec|Assoc|[oO]\\\.[kK]|Co|R\\\.V|Gov|Se[nc]|U\\\.N|\[A-Z\]|i\\\.e|e\\\.g|vs?|Re[pv]|Gen|Univ|Jr|[fF]t|[Ss]gt|[Pp]res|[Pp]rof|[Aa]pprox|[Cc]orp|[Dd]ef";
$abbr2="D\\\.C";

system("dir * >temptemp.temp");
open(F,"temptemp.temp")or die "error";
$count=1;
while(<F>){
chomp;
if(/^ /){next;}
if(/ <DIR> /){next;}
if(length($_)==0){next;}
@array=split /\s+/;
$filename=$array[$#array];
if($filename=~/\.pl/){next;}
if($filename=~/\.bak/){next;}
if($filename=~/\.out/){next;}
if($filename=~/\.bat/){next;}
if($filename=~/temptemp\.temp/){next;}
print "[".$count."] ".$filename." -> ";
$outputfilename=$filename.".out";
print "$outputfilename\n";
iterative($filename,$outputfilename);
print "......done\n";
$count++;
}
close F;


# Main Script

sub iterative{
($filename,$outputfilename)=@_; #added
@sent=(); #added
$text="";#added;
open(FO,">$outputfilename")or die "can't create $outfilename\n";
open(FI,$filename)or die "can't open $filename\n"; #added
$i = 0;

while (<FI>) { # Read one line from text #modified
chomp;
if ( /^[ \t]*$/ ) { # Skip if the line is empty
next;
} else {
$line = $_; # store each line in $line
$line =~ s/^[ \t]+//; # Remove white spaces at the beginning of the line
$line =~ s/ +/ /g; # Remove neighboring spaces
$line =~ s/\t//g; # Remove tab
$line =~ s/ +$//; # Remove spaces at the end of line
$text .= " ".$line; # Put together all lines
}
}
$text = substr($text,1); # Remove the space at initial position
$text =~ s/\? /\?\n/g; # New line at ? space
$text =~ s/! /!\n/g; # New line at ! space
$text =~ s/\.\" /\."\n/g; # New like at ." space
$text =~ s/\?\" ([A-Z])/\?\"\n$1/g; # New line at ?" space capital
$text =~ s/!\" ([A-Z])/!\"\n$1/g; # New line at !" space capital
$text =~ s/(\.\?!)\) /$1\)\n/g; # New line at .) space
$text =~ s/\. ([A-Z])/\.\n$1/g; # New line at . space capital
$text =~ s/\. \"/\.\n\"/g; # New line at . space "
$text =~ s/\" \"/\"\n\"/g; # New line between " and "
$text =~ s/\b($abbr1)\.\n/$1\. /g; # Delete new line at $abbr1
$text =~ s/\b($abbr2)\. ([A-Z\"])/$1\n$2/g; # New line at $abbr2
@sentence = split ( /\n/, $text); # Store sentences in the array
foreach $out ( @sentence ) {
if ( $out =~ /\".+\"/ ) { # Skip if " appears more than once
$i++;
next;
} else {
$sentence[$i] =~ s/\"//; # Remove "
$i++;
}
}
foreach $out ( @sentence ) {
print FO $out, "\n";
}
close FI; # added;
close FO;
}
# Updated 26 March 2006
 
回复: Sentence Segmenter

哇塞。老熊的英文相当漂亮爱。赞!计算机技术就更不用说了。
 
回复: Sentence Segmenter

谢谢谢谢谢谢谢谢
谢谢谢谢谢谢谢谢
谢谢谢谢谢谢谢谢

谢谢谢谢谢谢谢谢
谢谢谢谢谢谢谢谢
谢谢谢谢谢谢谢谢
 
Back
顶部