如何将标注好的文本按段落整理

刘语料

封禁用户
我使用TOSCA-LOB将文本进行词性标注后,又利用肖博士提供的格式转化程序将列格式转化为行格式,得到如下文本:
<p>
Prince_NPT Harry_NP 's_GENM New_NP Year_NP vow_NN :_SCOL no_ATI smoking_VBG ._SPER
</p><p>
Prince_NPT Harry_NP ,_SCOM 22_CD ,_SCOM the_ATI third_OD in_IN line_NN for_IN the_ATI British_JNP throne_NN has_HVZ now_RN vowed_VBD to_TO quit_VB smoking_VBG in_IN the_ATI New_NP Year_NP ,_SCOM ahead_RB of_IN a_AT sweeping_VBG Army_NN ban_NN ,_SCOM the_ATI Daily_JJ Mail_NN reported_VBN on_IN Tuesday_NR ._SPER

</p><p>
He_PP3A has_HVZ smoked_JJ up_RP to_TO 20_CD Marlboro_NP Lights_NNS a_AT day_NN ever_RB since_CS ,_SCOM even_CS-1 though_CS-2 his_PPG year-long_NN army_NN training_NN course_NN at_IN Sandhurst_NP ,_SCOM according_IN-1 to_IN-2 the_ATI report_NN ._SPER

</p><p>
Cadets_NNS at_IN the_ATI Royal_NPT Military_JJ Academy_NP are_BER not_XNOT allowed_VBN to_TO smoke_VB inside_IN the_ATI college_NN but_CC are_BER permitted_VBN to_TO do_DO so_QL in_IN their_PPG free_JJ time_NN ._SPER

</p><p>
Harry_NP had_HVD earlier_JJR insisted_VBD that_CS his_PPG smoking_VBG habit_NN did_DOD not_XNOT interfere_VB with_IN his_PPG health_NN of_IN fitness_NN ._SPER


But_CC he_PP3A has_HVZ finally_RB made_VBN a_AT New_NP Year_NP 's_GENM resolution_NN to_TO give_VB it_PP3 up_RP for_RB-1 good_RB-2 ._SPER

</p><p>
His_PPG decision_NN comes_VBZ as_CS the_ATI Ministry_NN of_IN Defense_NP prepares_VBZ to_TO ban_VB smoking_VBG at_IN all_ABN army_NN barracks_NNS from_IN March_NR this_DT year_NN ._SPER


Harry_NP wants_VBZ to_TO cut_VBN down_RP gradually_RB before_CS the_ATI new_JJ regulations_NNS come_VB in_IN ,_SCOM according_IN-1 to_IN-2 reports_NNS ._SPER

</p><p>
His_PPG decision_NN is_BEZ likely_JJ to_TO hearten_VB his_PPG father_NN ,_SCOM Prince_NPT Charles_NP ,_SCOM who_WPR loathes_VBZ his_PPG son_NN 's_GENM nicotine_NN habit_NN ._SPER

</p><p>
Prince_NPT Charles_NP '_GENM wife_NN ,_SCOM Camilla_NP ,_SCOM the_ATI Duchess_NPT of_IN Cornwall_NP ,_SCOM was_BEDZ herself_PPL a_AT heavy_JJ smoker_NN ,_SCOM but_CC is_BEZ believed_VBD to_TO have_HV given_VBN it_PP3 up_RP in_IN recent_JJ years_NNS ._SPER

</p><p>
However_RB ,_SCOM unfortunately_RB for_IN Harry_NP ,_SCOM whose_WPGR girlfriend_NN ,_SCOM Chelsy_NP Davy_NP is_BEZ a_AT social_JJ smoker_NN ,_SCOM giving_VBG up_RP the_ATI smoking_VBG habit_NN could_MD be_BE a_RB-1 little_RB-2 harder_JJR ._SPER

</p>

请专家指点如何将这个文本整理成如下格式的文本(以段落形式存在的文本):
<p> Prince_NPT Harry_NP 's_GENM New_NP Year_NP vow_NN :_SCOL no_ATI smoking_VBG ._SPER </p>
<p> Prince_NPT Harry_NP ,_SCOM 22_CD ,_SCOM the_ATI third_OD in_IN line_NN for_IN the_ATI British_JNP throne_NN has_HVZ now_RN vowed_VBD to_TO quit_VB smoking_VBG in_IN the_ATI New_NP Year_NP ,_SCOM ahead_RB of_IN a_AT sweeping_VBG Army_NN ban_NN ,_SCOM the_ATI Daily_JJ Mail_NN reported_VBN on_IN Tuesday_NR ._SPER </p>
<p> He_PP3A has_HVZ smoked_JJ up_RP to_TO 20_CD Marlboro_NP Lights_NNS a_AT day_NN ever_RB since_CS ,_SCOM even_CS-1 though_CS-2 his_PPG year-long_NN army_NN training_NN course_NN at_IN Sandhurst_NP ,_SCOM according_IN-1 to_IN-2 the_ATI report_NN ._SPER </p>
<p> Cadets_NNS at_IN the_ATI Royal_NPT Military_JJ Academy_NP are_BER not_XNOT allowed_VBN to_TO smoke_VB inside_IN the_ATI college_NN but_CC are_BER permitted_VBN to_TO do_DO so_QL in_IN their_PPG free_JJ time_NN ._SPER </p>
<p> Harry_NP had_HVD earlier_JJR insisted_VBD that_CS his_PPG smoking_VBG habit_NN did_DOD not_XNOT interfere_VB with_IN his_PPG health_NN of_IN fitness_NN ._SPER But_CC he_PP3A has_HVZ finally_RB made_VBN a_AT New_NP Year_NP 's_GENM resolution_NN to_TO give_VB it_PP3 up_RP for_RB-1 good_RB-2 ._SPER </p>
<p> His_PPG decision_NN comes_VBZ as_CS the_ATI Ministry_NN of_IN Defense_NP prepares_VBZ to_TO ban_VB smoking_VBG at_IN all_ABN army_NN barracks_NNS from_IN March_NR this_DT year_NN ._SPER Harry_NP wants_VBZ to_TO cut_VBN down_RP gradually_RB before_CS the_ATI new_JJ regulations_NNS come_VB in_IN ,_SCOM according_IN-1 to_IN-2 reports_NNS ._SPER </p>
<p> His_PPG decision_NN is_BEZ likely_JJ to_TO hearten_VB his_PPG father_NN ,_SCOM Prince_NPT Charles_NP ,_SCOM who_WPR loathes_VBZ his_PPG son_NN 's_GENM nicotine_NN habit_NN ._SPER </p>
<p> Prince_NPT Charles_NP '_GENM wife_NN ,_SCOM Camilla_NP ,_SCOM the_ATI Duchess_NPT of_IN Cornwall_NP ,_SCOM was_BEDZ herself_PPL a_AT heavy_JJ smoker_NN ,_SCOM but_CC is_BEZ believed_VBD to_TO have_HV given_VBN it_PP3 up_RP in_IN recent_JJ years_NNS ._SPER </p>
<p> However_RB ,_SCOM unfortunately_RB for_IN Harry_NP ,_SCOM whose_WPGR girlfriend_NN ,_SCOM Chelsy_NP Davy_NP is_BEZ a_AT social_JJ smoker_NN ,_SCOM giving_VBG up_RP the_ATI smoking_VBG habit_NN could_MD be_BE a_RB-1 little_RB-2 harder_JJR ._SPER </p>

上面的格式是我用手工一段一段弄好的,我想请专家帮我编写一个程序处理,这样就可以提高效率.
谢谢!
 
回复: 如何将标注好的文本按段落整理

段落标记不难,应该不需要编程就可以解决。记得以前整个论坛上讨论过这个问题。一种方法是通过word中搜索替换键入^p(代表回车键)和<p>,</p>来完成段落标记的插入。另外一种方法是在EditPlus中(其他的文本处理软件也可以)^代表行首(实际上是段落首)替换成<p>,$代表行尾(实际上就是段落尾)替换成</p>,如果是多个文件,选择“all open files”,进行批量处理。要注意的是,词性标注好的文本应该是段落界限清晰,将多余的空格要清除掉。我觉得用EditPlus比较方便得多。
到是句子的标记麻烦一些。
 
回复: 如何将标注好的文本按段落整理

段落标记不难,应该不需要编程就可以解决。记得以前整个论坛上讨论过这个问题。一种方法是通过word中搜索替换键入^p(代表回车键)和<p>,</p>来完成段落标记的插入。另外一种方法是在EditPlus中(其他的文本处理软件也可以)^代表行首(实际上是段落首)替换成<p>,$代表行尾(实际上就是段落尾)替换成</p>,如果是多个文件,选择“all open files”,进行批量处理。要注意的是,词性标注好的文本应该是段落界限清晰,将多余的空格要清除掉。我觉得用EditPlus比较方便得多。
到是句子的标记麻烦一些。

谢谢Oscar3的指点,我想使用最方便的方法.
 
回复: 如何将标注好的文本按段落整理

忘记了一点,就是查找替换要选择正则表达(RegularExpression,不过相信大家对这个很熟悉,我的提醒是多余。
 
Back
顶部