如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

刘语料

封禁用户
如题, 请各位大侠帮我将文本1变成文本2,文本3变成文本4.

1​

<s id=10016> “工资”(wages) 具有《雇佣条例》(第57章)第2(1)条给予该词的涵义;
<s id=10017> “扣押令”(attachment order) 指根据《未成年人监护条例》(第13章)第20条、《分居令及赡养令条例》(第16章)第9A条或《婚姻法律程序与财产条例》(第192章)第28条作出的扣押入息令,并包括根据第9条作出的扣押入息令的更改;
<s id=10018> “有关赡养令”(related maintenance order) 指《未成年人监护条例》(第13章)第20(2)条、《分居令及赡养令条例》(第16章)第9A(2)条或《婚姻法律程序与财产条例》(第192章)第28(2)条(视何者适当而定)所指明的命令,而该命令的强制执行属根据本规则提出的申请的标的;
<s id=10019> “指定受款人”(designated payee) 指法院在有关赡养令中所指名的受款人;
<s id=10020> “经济能力陈述书”(statement of means) 指赡养费支付人的入息与开支陈述书,其格式由终审法院首席法官订明;

2​

<s id=10016> “工资”具有《雇佣条例》(第57章)第2(1)条给予该词的涵义;
<s id=10017> “扣押令”指根据《未成年人监护条例》(第13章)第20条、《分居令及赡养令条例》(第16章)第9A条或《婚姻法律程序与财产条例》(第192章)第28条作出的扣押入息令,并包括根据第9条作出的扣押入息令的更改;
<s id=10018> “有关赡养令” 指《未成年人监护条例》(第13章)第20(2)条、《分居令及赡养令条例》(第16章)第9A(2)条或《婚姻法律程序与财产条例》(第192章)第28(2)条(视何者适当而定)所指明的命令,而该命令的强制执行属根据本规则提出的申请的标的;
<s id=10019> “指定受款人” 指法院在有关赡养令中所指名的受款人;
<s id=10020> “经济能力陈述书” 指赡养费支付人的入息与开支陈述书,其格式由终审法院首席法官订明;


3​

<s id=25> "act" (作为), when used with reference to an offence or civil wrong, includes a series of acts, an illegal omission and a series of illegal omissions;
<s id=26> "Administrative Appeals Board" (行政上诉委员会) means the Administrative Appeals Board established under the Administrative Appeals Board Ordinance (Cap 442); (Added 6 of 1994 s. 32)
<s id=27> "adult" (成人、成年人)* means a person who has attained the age of 18 years; (Amended 32 of 1990 s. 6)


4​

<s id=25> "act" , when used with reference to an offence or civil wrong, includes a series of acts, an illegal omission and a series of illegal omissions;
<s id=26> "Administrative Appeals Board" means the Administrative Appeals Board established under the Administrative Appeals Board Ordinance (Cap 442); (Added 6 of 1994 s. 32)
<s id=27> "adult" means a person who has attained the age of 18 years; (Amended 32 of 1990 s. 6)

谢谢!
 
如题, 请各位大侠帮我将文本1变成文本2,文本3变成文本4.

1​

<s id=10016> “工资”(wages) 具有《雇佣条例》(第57章)第2(1)条给予该词的涵义;
<s id=10017> “扣押令”(attachment order) 指根据《未成年人监护条例》(第13章)第20条、《分居令及赡养令条例》(第16章)第9A条或《婚姻法律程序与财产条例》(第192章)第28条作出的扣押入息令,并包括根据第9条作出的扣押入息令的更改;
<s id=10018> “有关赡养令”(related maintenance order) 指《未成年人监护条例》(第13章)第20(2)条、《分居令及赡养令条例》(第16章)第9A(2)条或《婚姻法律程序与财产条例》(第192章)第28(2)条(视何者适当而定)所指明的命令,而该命令的强制执行属根据本规则提出的申请的标的;
<s id=10019> “指定受款人”(designated payee) 指法院在有关赡养令中所指名的受款人;
<s id=10020> “经济能力陈述书”(statement of means) 指赡养费支付人的入息与开支陈述书,其格式由终审法院首席法官订明;

2​

<s id=10016> “工资”具有《雇佣条例》(第57章)第2(1)条给予该词的涵义;
<s id=10017> “扣押令”指根据《未成年人监护条例》(第13章)第20条、《分居令及赡养令条例》(第16章)第9A条或《婚姻法律程序与财产条例》(第192章)第28条作出的扣押入息令,并包括根据第9条作出的扣押入息令的更改;
<s id=10018> “有关赡养令” 指《未成年人监护条例》(第13章)第20(2)条、《分居令及赡养令条例》(第16章)第9A(2)条或《婚姻法律程序与财产条例》(第192章)第28(2)条(视何者适当而定)所指明的命令,而该命令的强制执行属根据本规则提出的申请的标的;
<s id=10019> “指定受款人” 指法院在有关赡养令中所指名的受款人;
<s id=10020> “经济能力陈述书” 指赡养费支付人的入息与开支陈述书,其格式由终审法院首席法官订明;


3​

<s id=25> "act" (作为), when used with reference to an offence or civil wrong, includes a series of acts, an illegal omission and a series of illegal omissions;
<s id=26> "Administrative Appeals Board" (行政上诉委员会) means the Administrative Appeals Board established under the Administrative Appeals Board Ordinance (Cap 442); (Added 6 of 1994 s. 32)
<s id=27> "adult" (成人、成年人)* means a person who has attained the age of 18 years; (Amended 32 of 1990 s. 6)


4​

<s id=25> "act" , when used with reference to an offence or civil wrong, includes a series of acts, an illegal omission and a series of illegal omissions;
<s id=26> "Administrative Appeals Board" means the Administrative Appeals Board established under the Administrative Appeals Board Ordinance (Cap 442); (Added 6 of 1994 s. 32)
<s id=27> "adult" means a person who has attained the age of 18 years; (Amended 32 of 1990 s. 6)

谢谢!

要完全解决问题似乎有点难。有个软件TextPro可以分别可以去掉半角和全角字符。可惜你的文本已经过标注,比如文本1中既有半角的标注符号,也有文本中的半角符号。文本3到是可以顺利转换成文本4。不过,如果你在将文本1标注之前做这步工作到是可以。
 
要完全解决问题似乎有点难。有个软件TextPro可以分别可以去掉半角和全角字符。可惜你的文本已经过标注,比如文本1中既有半角的标注符号,也有文本中的半角符号。文本3到是可以顺利转换成文本4。不过,如果你在将文本1标注之前做这步工作到是可以。


我也是在使用TextPro后发现不能解决问题才请教各位的.

从文本3转化为本文4的过程中,可以去掉汉字,但是仍然有括号,去掉括号就会将别的括号一同去掉.

谢谢Oscar3.
 
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

一個思路:判斷字符的ASCII代碼的值,中文和英文的不一樣的。
 
Re: 回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

一個思路:判斷字符的ASCII代碼的值,中文和英文的不一樣的。


这种方法我使用Word的替换功能尝试过,效果不理想.
 
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

I would assume the English and Chinese texts are stored in separate files (they can be separated easily, anyway). You can then use regular expressions (in Perl, Python etc.) to remove the unwanted elements in the two texts.

Removing English in parentheses in the Chinese text:

#removing 26 alphabets in upper and lower cases, and white spaces and hyphens
#it appears that English glosses in the Chinese texts consist of these characters

$line=~s/\(A-Za-z -\)*?//g;

Removing Chinese in parentheses in the English text:

#removes all elements in parentheses other than 26 alphabets in upper and lower cases, white spaces, hyphens and dots
#I am not quite sure if this one really works, you may need to have someone test it for you

$line=~s/\(^[a-zA-Z \.\-]\)//g;
 
Re: 回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

I would assume the English and Chinese texts are stored in separate files (they can be separated easily, anyway). You can then use regular expressions (in Perl, Python etc.) to remove the unwanted elements in the two texts.

Removing English in parentheses in the Chinese text:

#removing 26 alphabets in upper and lower cases, and white spaces and hyphens
#it appears that English glosses in the Chinese texts consist of these characters

$line=~s/\(A-Za-z -\)*?//g;

Removing Chinese in parentheses in the English text:

#removes all elements in parentheses other than 26 alphabets in upper and lower cases, white spaces, hyphens and dots
#I am not quite sure if this one really works, you may need to have someone test it for you

$line=~s/\(^[a-zA-Z \.\-]\)//g;


肖老师,您能不能将上面的程序做成.pl文件.
谢谢!
 
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

I cannot because I don't know your file structure and format. If you send me your texts in a zipped archive, I will have a look for you.
 
Re: 回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

I cannot because I don't know your file structure and format. If you send me your texts in a zipped archive, I will have a look for you.


Ok,here are two samples.
 

附件

  • c.txt
    764 bytes · 浏览: 8
  • e.txt
    469 bytes · 浏览: 9
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

Removing English in parentheses in the Chinese text:
$line=~s/\(A-Za-z -\)*?//g;

Removing Chinese in parentheses in the English text:
$line=~s/\(^[a-zA-Z \.\-]\)//g;

Clever hack! This is where the power of Regular Expression shines.
 
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

You can save the following lines as format.pl and put this file in the same folder that your corpus files are stored (assuming you have installed Perl and your Chinese files have a filename ending with c.txt and English files ending with e.txt, case insensitive):

---BEGIN---
use Encode;
use utf8;
opendir (DIR, ".") or die "Could not open the current directory";
@files=grep (/\.txt/i, readdir (DIR));
closedir (DIR);
foreach $file (sort (@files))
{
print "Processing $file\n";
$output="new_".$file;
open (FHI, "<:utf8", $file) or die "Could not open $file";
open (FHO, ">:utf8", "$output.txt") or die "Could not write to the $output file";
if ($file=~/c\.txt/i)
{
while ($line=<FHI>)
{
$line=~s/\([A-Z-a-z \.]*?\)//g;
print FHO $line;
}
}
if ($file=~/e\.txt/i)
{
while ($line=<FHI>)
{
$line=~s/\(\S*?\)//g;
print FHO $line;
}
}
close (FHI);
close (FHO);
}
---END---

Doule click on format.pl, and you'll have a file named "new_old-filename".
You are there. Cheers.
 
Re: 回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

You can save the following lines as format.pl and put this file in the same folder that your corpus files are stored (assuming you have installed Perl and your Chinese files have a filename ending with c.txt and English files ending with e.txt, case insensitive):

---BEGIN---
use Encode;
use utf8;
opendir (DIR, ".") or die "Could not open the current directory";
@files=grep (/\.txt/i, readdir (DIR));
closedir (DIR);
foreach $file (sort (@files))
{
print "Processing $file\n";
$output="new_".$file;
open (FHI, "<:utf8", $file) or die "Could not open $file";
open (FHO, ">:utf8", "$output.txt") or die "Could not write to the $output file";
if ($file=~/c\.txt/i)
{
while ($line=<FHI>)
{
$line=~s/\([A-Z-a-z \.]*?\)//g;
print FHO $line;
}
}
if ($file=~/e\.txt/i)
{
while ($line=<FHI>)
{
$line=~s/\(\S*?\)//g;
print FHO $line;
}
}
close (FHI);
close (FHO);
}
---END---

Doule click on format.pl, and you'll have a file named "new_old-filename".
You are there. Cheers.


Thanks,Dr.Xiao.
I do as you tell me, and it does a fansanstic job.
So,I'll learn some programming.
 
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

如果分别是单独的文件的话,不懂程序的人可用正则表达式啊,比如说去掉第一个文档括号内的英文
用 “\([(a-zA-Z)|\s]*\)”替换为空就行了。
 
回复: 如何去掉括号及括号里的英语及如何去掉括号及括号里的汉语?

去掉中文括号用“\(([^a-zA-Z]*?)\)”就行了,两个表达式都不含引号。
 
Back
顶部