【菜鸟求教】如何清除单词只保存赋码?

各位大侠好!
小弟初学利用语料库研究翻译文本,现在想调查一个译本中最常用的词性搭配模式。有人说可以用Perl编程语言清除单词保留赋码,然后用AntConc来检索N-gram 然后再观察分析。

我浏览了一下这个论坛有人讨论过这个问题:
http://www.corpus4u.org/forum/showthread.php?t=5219
可惜我试验了一下还是不太管用。

我把我需要处理的文本黏贴如下:
Click_VVB ,_, click_VVB ,_, for_AV021 ever_AV022 click_VVB ,_, click_VVB ;_;
Mulan_NP0 sits_VVZ at_PRP the_AT0 door_NN1 and_CJC weaves_NN2 ._.
Listen_VVB ,_, and_CJC you_PNP will_VM0 not_XX0 hear_VVI the_AT0 shuttle_NN1
's_POS sound_NN1 ,_, But_CJC only_AV0 hear_VVB a_AT0 girl_NN1 's_POS sobs_NN2
and_CJC sighs_VVZ ._.
'Oh_NN1 ,_, tell_VVB me_PNP ,_, lady_NN1 ,_, are_VBB you_PNP thinking_VVG
of_PRF your_DPS love_NN1 ,_, 'Oh_NN1 tell_VVB me_PNP ,_, lady_NN1 ,_, are_VBB
you_PNP longing_VVG for_PRP your_DPS dear_NN1 ?_? '_POS 'Oh_NN1 no_ITJ ,_,
oh_ITJ no_ITJ ,_, I_PNP am_VBB not_XX0 thinking_VVG of_PRF my_DPS love_NN1 ,_,
Oh_ITJ no_ITJ ,_, oh_ITJ no_ITJ ,_, I_PNP am_VBB not_XX0 longing_VVG for_PRP
my_DPS dear_NN1 ._.
But_CJC last_ORD night_NN1 I_PNP read_VVB the_AT0 battle-roll_NN1 ;_; The_AT0
Khan_NP0 has_VHZ ordered_VVN a_AT0 great_AJ0 levy_NN1 of_PRF men_NN2 ._.
The_AT0 battle-roll_NN1 was_VBD written_VVN in_PRP twelve_CRD books_NN2 ;_;
And_CJC in_PRP each_DT0 book_NN1 stood_VVD my_DPS father_NN1 's_POS name_NN1
._.
My_DPS father_NN1 's_POS sons_NN2 are_VBB not_XX0 grown_VVN men_NN2 ._.
And_CJC of_PRF all_DT0 my_DPS brothers_NN2 ,_, none_PNI is_VBZ older_AJC
than_CJS me_PNP ._.
Oh_ITJ let_VVB me_PNP to_PRP the_AT0 market_NN1 to_TO0 buy_VVI saddle_NN1
and_CJC horse_NN1 ,_, And_CJC ride_VVI with_PRP the_AT0 soldiers_NN2 to_TO0
take_VVI my_DPS father_NN1 's_POS place_NN1 ._. '_"
In_PRP the_AT0 eastern_AJ0 market_NN1 she_PNP 's_VBZ bought_VVN a_AT0
gallant_AJ0 horse_NN1 ,_, In_PRP the_AT0 western_AJ0 market_NN1 she_PNP 's_VBZ
bought_VVN saddle_NN1 and_CJC cloth_NN1 ,_, In_PRP the_AT0 southern_AJ0
market_NN1 she_PNP 's_VBZ bought_VVN snaffle_NN1 and_CJC reins_NN2 ,_, In_PRP
the_AT0 northern_AJ0 market_NN1 she_PNP 's_VBZ bought_VVN a_AT0 tall_AJ0
whip_NN1 ._.
In_PRP the_AT0 morning_NN1 she_PNP stole_VVD from_PRP her_DPS father_NN1
's_POS and_CJC mother_NN1 's_POS house_NN1 ;_; At_PRP night_NN1 she_PNP
was_VBD camping_VVG by_PRP the_AT0 Yellow_NP0 River_NP0 's_POS side_NN1 ,_,
She_PNP could_VM0 not_XX0 hear_VVI father_NN1 and_CJC mother_NN1 calling_VVG
to_PRP her_PNP by_PRP her_DPS name_NN1 ,_, But_CJC only_AV0 the_AT0 voice_NN1
of_PRF the_AT0 Yellow_NP0 River_NP0 as_CJS its_DPS waters_NN2 swirled_VVD
through_PRP the_AT0 night_NN1 ._.
At_PRP dawn_NN1 they_PNP left_VVD the_AT0 River_NN1 and_CJC went_VVD on_PRP
their_DPS way_NN1 ;_; At_PRP dusk_NN1 they_PNP came_VVD to_PRP the_AT0
Black_AJ0 Water_NN1 's_POS side_NN1 ._.
She_PNP could_VM0 not_XX0 hear_VVI her_DPS father_NN1 and_CJC mother_NN1
calling_VVG to_PRP her_PNP by_PRP her_DPS name_NN1 ,_, She_PNP could_VM0
only_AV0 hear_VVI the_AT0 muffled_AJ0 voices_NN2 of_PRF foreign_AJ0
horsemen_NN2 riding_VVG on_PRP the_AT0 hills_NN2 of_PRF Yen_NN0 ._.
A_AT0 thousand_CRD leagues_NN2 she_PNP tramped_VVD on_PRP the_AT0 errands_NN2
of_PRF war_NN1 ,_, Frontiers_NN2 and_CJC hills_NN2 she_PNP crossed_VVD
like_PRP a_AT0 bird_NN1 in_PRP flight_NN1 ._.
Through_PRP the_AT0 northern_AJ0 air_NN1 echoed_VVD the_AT0 watchman_NN1
's_POS tap_NN1 ;_; The_AT0 wintry_AJ0 light_NN1 gleamed_VVD on_PRP coats_NN2
of_PRF mail_NN1 ._.
The_AT0 captain_NN1 had_VHD fought_VVN a_AT0 hundred_CRD fights_NN2 ,_,
and_CJC died_VVN ._.
The_AT0 wariors_NN2 in_PRP ten_CRD years_NN2 had_VHD won_VVN their_DPS
rest_NN1 ._.
They_PNP went_VVD home_AV0 ,_, they_PNP saw_VVD the_AT0 Emperor_NN1 's_POS
face_NN1 ;_; The_AT0 Son_NN1 of_PRF Heaven_NN1 was_VBD seated_VVN in_PRP
the_AT0 Hall_NN1 of_PRF Light_NN1 ._.
The_AT0 deeds_NN2 of_PRF the_AT0 brave_AJ0 were_VBD recorded_VVN in_PRP
twelve_CRD books_NN2 ;_; In_PRP prizes_NN2 he_PNP gave_VVD a_AT0 hundred_CRD
thousand_CRD cash_NN1 ._.
Then_AV0 spoke_VVD the_AT0 Khan_NP0 and_CJC asked_VVD her_PNP what_DTQ she_PNP
would_VM0 take_VVI ._.
'Oh_NN1 ,_, Mulan_NP0 asks_VVZ not_XX0 to_TO0 be_VBI made_VVN A_AT0
Counsellor_NN1 at_PRP the_AT0 Khan_NP0 's_POS court_NN1 ;_; I_PNP only_AV0
beg_VVB for_PRP a_AT0 camel_NN1 that_CJT can_VM0 march_VVI A_AT0 thousand_CRD
leagues_NN2 a_AT0 day_NN1 ,_, To_TO0 take_VVI me_PNP back_AVP to_PRP my_DPS
home_NN1 ._. '_"
When_CJS her_DPS father_NN1 and_CJC mother_NN1 heard_VVD that_CJT she_PNP
had_VHD come_VVN ,_, They_PNP went_VVD out_AVP to_PRP the_AT0 wall_NN1 and_CJC
led_VVD her_PNP back_AVP to_PRP the_AT0 house_NN1 ._.
When_CJS her_DPS little_AJ0 sister_NN1 heard_VVD that_CJT she_PNP had_VHD
come_VVN ,_, She_PNP went_VVD to_PRP the_AT0 door_NN1 and_CJC rouged_VVD
her_DPS face_NN1 afresh_AV0 ._.
When_CJS her_DPS little_AJ0 brother_NN1 heard_VVD that_CJT his_DPS sister_NN1
had_VHD come_VVN ,_, He_PNP sharpened_VVD his_DPS knife_NN1 and_CJC darted_VVD
like_PRP a_AT0 flash_NN1 Towards_PRP the_AT0 pigs_NN2 and_CJC sheep_NN0 ._.
She_PNP opened_VVD the_AT0 gate_NN1 that_CJT leds_VVZ to_PRP the_AT0
eastern_AJ0 tower_NN1 ,_, She_PNP sat_VVD on_PRP her_DPS bed_NN1 that_CJT
stood_VVD in_PRP the_AT0 western_AJ0 tower_NN1 ._.
She_PNP cast_VVD aside_AV0 her_DPS heavy_AJ0 soldier_NN1 's_POS cloak_NN1 ,_,
And_CJC wore_VVD again_AV0 her_DPS old-time_AJ0 dress_NN1 ._.
She_PNP stood_VVD at_PRP the_AT0 window_NN1 and_CJC bound_VVD her_DPS
cloudy_AJ0 hair_NN1 ;_; She_PNP went_VVD to_PRP the_AT0 mirror_NN1 and_CJC
fastened_VVD her_DPS yellow_AJ0 combs_NN2 ._.
She_PNP left_VVD the_AT0 house_NN1 and_CJC met_VVD her_DPS messmates_NN2
in_PRP the_AT0 road_NN1 ,_, Her_DPS messmates_NN2 were_VBD startled_VVN
out_PRP21 of_PRP22 their_DPS wits_NN2 ._.
They_PNP had_VHD marched_VVN with_PRP her_PNP for_PRP twelve_CRD years_NN2
of_PRF war_NN1 And_CJC never_AV0 known_VVN that_CJT Mulan_NP0 was_VBD a_AT0
girl_NN1 ._.
For_PRP the_AT0 male_AJ0 hare_NN1 sits_VVZ with_PRP its_DPS legs_NN2
tucked_VVD in_AVP ,_, And_CJC the_AT0 female_AJ0 hare_NN1 is_VBZ known_VVN
for_PRP her_DPS bleary_AJ0 eye_NN1 ;_; But_CJC set_VVB them_PNP both_DT0
scampering_VVG side_NN1 by_PRP side_NN1 ,_, And_CJC who_PNQ so_AV0 wise_AJ0
could_VM0 tell_VVI you_PNP "_" This_DT0 is_VBZ he_PNP "_" ?_?


这个是木兰辞的一个英译本,我想问各位,具体我该怎么做才能清除单词只保存赋码呢?

还有一个问题请问:什么办法可以给单词长度该如何赋码并且统计文本中单词长度?

最后我还有个问题请教如何计算平均句长我用了一个笨办法:就是在word文档里统计多少个?!来计算出句子数目,再用总的形符来除以句子数来计算平均句长。这个办法可能也不够精确,请问各位大侠有没有好办法?

十分感谢各位语料库高手的指导!
 

armstrong

高级会员
回复: 【菜鸟求教】如何清除单词只保存赋码?

用PowerGREP软件中的替换命令: 在查找框中输入(\S+)_(\S+)
在替换框中输入\2
其它默认就行了.

至于计算词长搜索本站,有免费共享软件可以使用.
 
回复: 求教许博士

PowerGREP试用版可以用一个月。
http://download.jgsoft.com/powergrep/SetupPowerGREPDemo.exe
你不熟悉PowerGREP,其他任何支持正则表达式的文本编辑器都可以进行上面armstrong的替换功能。
真荣幸得到Armstrong和许博士的解答,此问题已经完好解答了。

还有个问题请问许老师,就是目前利用语料库研究翻译文本最主要的集中于词汇短语层面,比如形符类符标准比、词汇密度、单词均长、平均句长、高频词性搭配等方面,可是如何利用语料库来评价语篇层面,比如衔接连贯是不是还不够适用呢?
我看某位学者利用语料库统计and, however, because, since等显性连接的数目.

隐性连接,即不借助连接词或连接语而主要通过意义之间的自然顺序来表示的连接关系也就是说,主要通过语序来表达逻辑关系。隐性连接是汉语意合特征的具体表现。英语的形合特征,体现在语法的要求上,如若隐性连接两个短句则需要借助标点
符号来保证句子正确,此类标点主要包括, 、;! ?例如:I came; I saw; I conquered.中没有显性连接词,主要是通过意义之间的自然顺序
来表示连接关系,以“;”为衔接。该学者通过统计 标点符号作为语篇连贯分析,您觉得可行么?

如果利用语料库做语篇连贯分析,我们还能做到哪些呢?

抱歉,写的文字多了些,十分期待许教授指点!
 

xujiajin

管理员
Staff member
回复: 【菜鸟求教】如何清除单词只保存赋码?

语料库不适合做隐性语言特征研究。当然,基于语料库还是可以做不少语篇层面的研究的。一句话两句话说不清,你找书和文章看看吧。最近几年间,国际上基于语料的话语研究是个热点。
 
顶部