Automatic Sentence Segmentation

This program converts a plain running text into one-sentence-per-line format by simply adding a return code after the standard sentence terminal marks. The current version (ver. 2) covers most common and frequently-used abbreviations such as Mr., Dr., Prof., a.m., p.m. as well as sentence-initial list numbers, but you still have to post edit the results manually for other less common abbreviations. Note that when initials are followed by the "period + space" combination as in G. W. Bush the strings containing them will be divided immediately after the period mark, resulting in improper segmenetation in most cases. Also note that all strings must be terminited to be considered an indepenedent sentential unit.
http://www.someya-net.com/00-class09/sentenceDiv.html
 
回复: Automatic Sentence Segmentation

将上面的描述切分的结果是:
1 This program converts a plain running text into one-sentence-per-line format by simply adding a return code after the standard sentence terminal marks.
2 The current version (ver. 2) covers most common and frequently-used abbreviations such as Mr., Dr., Prof., a.m., p.m. as well as sentence-initial list numbers, but you still have to post edit the results manually for other less common abbreviations.
3 Note that when initials are followed by the "period + space" combination as in G.
4 W.
5 Bush the strings containing them will be divided immediately after the period mark, resulting in improper segmenetation in most cases.
6 Also note that all strings must be terminited to be considered an indepenedent sentential unit.
 
回复: Automatic Sentence Segmentation

http://misshoover.si.umich.edu/~zzheng/sentence/
这个也是在线切分的。要是有个独立的可以单机运行的程序来做切分就好了,最好是切分大文件时不要死机,看哪位大侠有时间整一个吧!先谢谢了!
把句子切分后导入EXCEL,再存为数据库,是否离在线检索又近了一步呢?
 
回复: Automatic Sentence Segmentation

sentence segmentation现在没有任何程序能百分百,但也没那么复杂. 要博士们编这样的软件,实在是'杀鸡用牛刀"了,无异于让建筑大师们去工地当小工使, 去扎钢筋搅拌水泥粉大墙,这是多么大的浪费呵.这活还是让咱民工干吧,这是咱的活,干得一定比博士好:p
句切分没有技术含量,我们啥都不是,只会用"电风扇"即word去句切分,千万以上的文字都处理了,可见这没什么花头,没什么神秘的. 但是具体操作要有针对性,不要指望写好一个"宏"或一个程序就能处理好所有的文本.比方说".?!"是句际标记,但如果带引号该如何处理?单引号,双引号, 单双引号连用, 引号前有空格之有无,小数点,缩略语....等等,等等,都是句切分要考虑的问题.
G.W.Bush等人名的处理不是软件能搞好的,因为软件编写人员想不到也想不全,就是想到了,不同文本还会有不同的情况.用word处理起来很灵活,很简单.如果你实践了,你就会知道word的奇妙.
 
Back
顶部