如何去除中英混合文本中的中文或英文?

回复:如何去除中英混合文本中的中文或英文?

The English text after separation (ENG_OUT.txt):

corpus AHD: k D.J.: 6k%8rp*s K.K.: 6k%rp*s n. pl. cor.po.ra; AHD: -p abbr: cor. A large collection of writings of a specific kind or on a specific subject. The principal or capital as distinguished from the interest or income as of a fund or estate. Anatomy The main part of a bodily structure or organ. A distinct bodily mass or organ having a specific function. Music The overall length of a violin. Middle English fromLatin *See Also : kw In Appendixrep- corpus AHD: k D.J.: 6k%8rp*s K.K.: 6k%rp*s n. pl. cor.po.ra; cor.po.ra; AHD: -p abbr: cor. A large collection of writings of a specific kind or on a specific subject. The principal or capital as distinguished from the interest or income as of a fund or estate. Anatomy The main part of a bodily structure or organ. A distinct bodily mass or organ having a specific function. Music The overall length of a violin. Middle English fromLatin *See Also : kw In Appendixrep- corpus 5kC: pEs n corpus 5kC: pEs n. pl. -pora -pErE corpus adiposum corpus callosum corpus delicti di5liktai corpus juris 5dVuEris corpus luteum corpus striatum actual corpus estate corpus habeas corpus 5heibjEs5kC: pEs trust corpus corpus 5kC: pEs n.
 
回复:如何去除中英混合文本中的中文或英文?

以下是引用 xiaoz2005-9-11 11:57:40 的发言:
Or maybe like this file: you want to "de-align" English from Chinese and save them as two separate files? And you have a lot of such files rather than just a dozen of them which you can process one by one? It's a piece of cake.

<p>
<s n="L1E_0001"> The_AT Future_NN1 of_IO Africa_NP1 </s>
<s n="L2C_0001"> 非洲_ns 的_u 未来_t </s>
</p>

Try using powergrep and regex to replace:
for example, to delete English:
(<s n="L1E([^<]+)</s>)

and to delete Chinese:
(<s n="L2C([^<]+)</s>)

then backup each text as either Chinese or English.
The regex for other annotated text may vary depending on how the text is marked up. But Powergrep can do all the deletion in a second.
 
回复: 回复:如何去除中英混合文本中的中文或英文?

It may take a little more than a simple RegExp. Here is a NoteTab clip (BilingualExtractor) I wrote. It will keep your original file intact, separate the Chinese portion of the text from the English portion, and finally output each as CHIN_OUT.txt and ENG_OUT.txt, respectively.
Use:
1) Save the clip to ..\NoteTab Light\Libraries;
2) POS tag your bilingual text with ICTCLAS;
(you can find this from within ACWT; for the required
settings for this step, read a previous post/screen shot on page 1);
3) Open up the POS tagged file (xxx.cla.txt) with NoteTab Light;
4) Find and apply BilingualExtractor (command title is C-E Extractor);
5) Hope it works okay.

It will Not work if (among other things):
- The file is not a clean ASCII text file;
- The English portion of the text is not clean ASCII (e.g. the Chinese fake
version of English letters);
- Your text is not processed in the right way by ICTCLAS.

我按上述方法进行中英混合文本分离时,总是遇到这样的报错信息而无法进行,麻烦大家帮我分析一下原因;还有:什么样的text文档可算为clean ASCII text file,怎样才能得到clean ASCII text file?谢谢大家!
 

附件

  • 1.jpg
    1.jpg
    7.2 KB · 浏览: 62
  • 2.jpg
    2.jpg
    7.6 KB · 浏览: 62
回复: 如何去除中英混合文本中的中文或英文?

我在EditPlus中是无法用正则式去除英文中的标点却不误删中文里的标点;也许TextPro里能解决只用正则表达式就去除中英文和其间的标点而不误删。用Word宏功能也只能去掉正文而不能保证所有的英文标点都会被删除。我做了一个除英文的宏,把文本粘贴进去,按ALT+C就可以了,接下只能手工了,有人能改进一下就好了。
 

附件

  • test.doc
    54.5 KB · 浏览: 6
回复: 如何去除中英混合文本中的中文或英文?

我在EditPlus中是无法用正则式去除英文中的标点却不误删中文里的标点;也许TextPro里能解决只用正则表达式就去除中英文和其间的标点而不误删。用Word宏功能也只能去掉正文而不能保证所有的英文标点都会被删除。我做了一个除英文的宏,把文本粘贴进去,按ALT+C就可以了,接下只能手工了,有人能改进一下就好了。
非常感谢hittle2008的帮助!你编的那个宏效果很不错!
不好意思,我的问题说得不够清楚,其实真正目的是:想得到两个文本,一个纯中文的、一个纯英文的。因此题目似乎改成“如何提取中英混合文本中的中文和英文?”
 
Back
顶部