http://www.eng.ritsumei.ac.jp/asao/resources/sentseg/
Name
sentseg.pl - Sentence Segmenter
**Synopsis: Usage
$ ./sentseg.pl < InputFile > OutputFile
**Description
This perl script takes a text file as standard input and splits it up so that each sentence is on a separate line. The script, however, does not gurantee 100 accuracy because of the reasons described in the Notes. See Notes below.
**Notes
Even though the script works fine for most puposes, 100 percent accuracy is not guranteed. The script determines the place of a sentence boundary on the basis of orthographic features and does not take into consideration its context. For this reason it is indispensable to scan the output file manually after the script is executed in order to see if any irregularies have occurred.
Most errors involve abbreviations with a full stop. The script handles popular abbreviations like Mr., Ms. Dr., and D.C. correctly. It is, however, unrealistic to exhaust all possibilities. If you are going to reapeat the work in a certain genre of text, you can improve its accuracy by modifying the list of abbreviations described in the script. In order to modify the list to suit your purpose, enter new abbreviations in lines 20 and 22.
Name
sentseg.pl - Sentence Segmenter
**Synopsis: Usage
$ ./sentseg.pl < InputFile > OutputFile
**Description
This perl script takes a text file as standard input and splits it up so that each sentence is on a separate line. The script, however, does not gurantee 100 accuracy because of the reasons described in the Notes. See Notes below.
**Notes
Even though the script works fine for most puposes, 100 percent accuracy is not guranteed. The script determines the place of a sentence boundary on the basis of orthographic features and does not take into consideration its context. For this reason it is indispensable to scan the output file manually after the script is executed in order to see if any irregularies have occurred.
Most errors involve abbreviations with a full stop. The script handles popular abbreviations like Mr., Ms. Dr., and D.C. correctly. It is, however, unrealistic to exhaust all possibilities. If you are going to reapeat the work in a certain genre of text, you can improve its accuracy by modifying the list of abbreviations described in the script. In order to modify the list to suit your purpose, enter new abbreviations in lines 20 and 22.