williamJia
开放语料库项目
闲暇时写了一个可以去除以下几类标注的小程序:
-------------------------------------------------------------------------------
这是北外许老师提出的要求:
1.希望能去处下面方括号中的内容(包括方括号)。
In previous time [np6, 2-], China wanted keenly to reach communisty [fm1,-] society early. However, it have [vp3, 1-] made our national economic [wd2, 2-] develop slowly, and to some extents [wd3, 2-], it did harm to [vp1, 1-4] our national economic's [fm2,-] development. Today our government adopted the open and reform policy. Our national economic [wd2, 2-] become [vp3, 3-] apparently better than ever. The people's living conditions are much improved. The people's educational level rised. [fm2,-]
2.希望能去处下面XML标记内容。
<w POS="PPIS1">I</w> <w POS="VV0">love</w> <w POS="DD1">this</w> <w POS="NN1">game</w> <w POS=".">.</w>
只保留:I love this game.
3.希望能去处下面下划线及标记内容。
I_PPIS1 love_VV0 this_DD1 game_NN1 ._.
只保留:I love this game.
4.望能去处下面斜线及标记内容。
I/PPIS1 love/VV0 this/DD1 game/NN1 ./.
只保留:I love this game.
5.希望能去处下面SGML标记内容。
<s n="298"><pause> <shift new=reading> <w PRP>On <w AT0>the <w ORD>fifth <w PRF>of <w NP0>September <w CRD>nineteen <w CRD>thirty <w CRD>nine <w AT0>a <w NN1>control <w PRF>of <w NN1>timber <w VVD>ordered <w VBD>was <w VVN>made <w PRP>by <w AT0>the <w NN1>Ministry <w PRF>of <w NN1>Supply<c PUN>, <w VVN>followed <w PRP>by <w NN2>controls <w PRP-AVP>on <w DT0>all <w AJ0>raw <w NN2>materials <w VVN>used <w PRP>by <w AT0>the <w NN1>furniture <w NN1>industry <w CJC>and <w AJ0>allied <w NN2>trades<c PUN>.
http://www.corpus4u.org/upload/forum/2006062816440947.rar
-------------------------------------------------------------------------------
这是北外许老师提出的要求:
1.希望能去处下面方括号中的内容(包括方括号)。
In previous time [np6, 2-], China wanted keenly to reach communisty [fm1,-] society early. However, it have [vp3, 1-] made our national economic [wd2, 2-] develop slowly, and to some extents [wd3, 2-], it did harm to [vp1, 1-4] our national economic's [fm2,-] development. Today our government adopted the open and reform policy. Our national economic [wd2, 2-] become [vp3, 3-] apparently better than ever. The people's living conditions are much improved. The people's educational level rised. [fm2,-]
2.希望能去处下面XML标记内容。
<w POS="PPIS1">I</w> <w POS="VV0">love</w> <w POS="DD1">this</w> <w POS="NN1">game</w> <w POS=".">.</w>
只保留:I love this game.
3.希望能去处下面下划线及标记内容。
I_PPIS1 love_VV0 this_DD1 game_NN1 ._.
只保留:I love this game.
4.望能去处下面斜线及标记内容。
I/PPIS1 love/VV0 this/DD1 game/NN1 ./.
只保留:I love this game.
5.希望能去处下面SGML标记内容。
<s n="298"><pause> <shift new=reading> <w PRP>On <w AT0>the <w ORD>fifth <w PRF>of <w NP0>September <w CRD>nineteen <w CRD>thirty <w CRD>nine <w AT0>a <w NN1>control <w PRF>of <w NN1>timber <w VVD>ordered <w VBD>was <w VVN>made <w PRP>by <w AT0>the <w NN1>Ministry <w PRF>of <w NN1>Supply<c PUN>, <w VVN>followed <w PRP>by <w NN2>controls <w PRP-AVP>on <w DT0>all <w AJ0>raw <w NN2>materials <w VVN>used <w PRP>by <w AT0>the <w NN1>furniture <w NN1>industry <w CJC>and <w AJ0>allied <w NN2>trades<c PUN>.
http://www.corpus4u.org/upload/forum/2006062816440947.rar