李亮1975重庆
语料库快乐军政委
【使用指南】
[1] thinner.py是基于NLTK的python脚本,必须被拷贝到NLTK所在的文件夹里面;
[2] 你必须在NLTK文件夹里面,手工建立“target”名称的文件夹,这个脚本只对“target”文件夹的txt文件进行处理;
[3] 你必须把你的语料库所涉及到的文件和文件夹都拷贝到"target"文件夹,NLTK文件夹本来没有target文件夹,是你新建的哟;
[4] 在你拷贝了脚本和拷贝了语料库到“target”文件夹之后,就可以双击NLTK文件夹的python.exe(你也许看不到“.exe”这几个字符,那也是正常的),python.exe就弹出黑底白字的画面,你就输入下面的字符串……
import thinner; thinner.start();
输入了上面的内容就回车,就看到正在逐个TXT文件被处理且包括子文件夹的子文件夹的子文件夹……的所有TXT文件哟!
[5] 由于NLTK自身的原因,形容词和副词的比较级和最高级的屈折形式并不会被还原,而是保持原状,也就是说greater还是greater,greatest还是greatest
【下载地址】在我百度网盘的共享页面……
http://pan.baidu.com/share/home?uk=724520607&view=share#category/type=0
【源代码】
[1] thinner.py是基于NLTK的python脚本,必须被拷贝到NLTK所在的文件夹里面;
[2] 你必须在NLTK文件夹里面,手工建立“target”名称的文件夹,这个脚本只对“target”文件夹的txt文件进行处理;
[3] 你必须把你的语料库所涉及到的文件和文件夹都拷贝到"target"文件夹,NLTK文件夹本来没有target文件夹,是你新建的哟;
[4] 在你拷贝了脚本和拷贝了语料库到“target”文件夹之后,就可以双击NLTK文件夹的python.exe(你也许看不到“.exe”这几个字符,那也是正常的),python.exe就弹出黑底白字的画面,你就输入下面的字符串……
import thinner; thinner.start();
输入了上面的内容就回车,就看到正在逐个TXT文件被处理且包括子文件夹的子文件夹的子文件夹……的所有TXT文件哟!
[5] 由于NLTK自身的原因,形容词和副词的比较级和最高级的屈折形式并不会被还原,而是保持原状,也就是说greater还是greater,greatest还是greatest
【下载地址】在我百度网盘的共享页面……
http://pan.baidu.com/share/home?uk=724520607&view=share#category/type=0
【源代码】
代码:
# this software is developed by Li Liang at GDUFS (Guangdong University of Foreign Studies)
# usage: this script needs to be named thinner.py
# step1: copy this script to the NLTK folder;
# step2: create a folder named "target", and copy your corpus folder into this new folder
# step3: double-click python.exe, and enter "import thinner; thinner.start(); ", and press ENTER
# You just need three steps above.
# The comparative and superlative forms of adj and adv are not changed due to NLTK's functionality.
import os;
import nltk;
def get_files(newpaths):
# newpaths is a list;
# get files of one or more folders
result=[];
if len(newpaths)>0 :
for newpath in newpaths:
tmplist=os.listdir(newpath);
for tmp in tmplist:
if os.path.isfile(newpath+"/"+tmp) : result.append(newpath+"/"+tmp);
return result;
def get_subfolders(newpaths):
# newpaths is a list;
# get direct subfolders of one or more folders
result=[];
if len(newpaths)>0 :
for newpath in newpaths:
tmplist=os.listdir(newpath);
for tmp in tmplist:
if os.path.isdir(newpath+"/"+tmp) : result.append(newpath+"/"+tmp);
return result;
def get_recursive_subfolders(newpath):
# newpath is a string;
result=[];
mypath=[];
mypath.append(newpath);
while 1:
tmp=get_subfolders(mypath);
result=result+tmp;
if len(tmp)==0: break;
mypath=tmp;
return result;
def get_recursive_files(newpath):
# get all recursive files in the folder;
result=[];
result=get_files([newpath]);
result=result+get_files(get_recursive_subfolders(newpath));
return result;
def get_recursive_txtfiles(newpath):
tmplist=get_recursive_files(newpath);
result=[];
for tmp in tmplist:
tmppath,tmpname=os.path.split(tmp);
tmpname=tmpname.lower();
if tmpname.endswith(".txt"): result.append(tmp);
return result;
def scan_files():
result=get_recursive_txtfiles(os.getcwd()+"\\target");
return result;
def do_onefile(filepath):
linelist=open(filepath).readlines();
for i in range(len(linelist)):
linelist[i]=clear_inflection_for_one_paragraph(linelist[i].replace(". "," . "));
open(filepath,"w").write("\r\n".join(linelist));
def clear_inflection_for_one_paragraph(one_paragraph):
result="";
tmp=clear_inflection_for_one_list(nltk.tokenize.word_tokenize(one_paragraph));
result=" ".join(tmp);
return result;
def clear_inflection_for_one_list(newlist):
result=[];
for ele in newlist:
tmp="";
if len(ele)>1 and ele.istitle():
ele=ele.lower();
tmp1=nltk.corpus.wordnet.morphy(ele);
tmp2=nltk.stem.WordNetLemmatizer().lemmatize(ele);
if tmp1==None and tmp2==None: tmp=ele;
if tmp1!=None and tmp2!=None and len(tmp1)>=len(tmp2): tmp=tmp2;
if tmp1!=None and tmp2!=None and len(tmp1)<=len(tmp2): tmp=tmp1;
if tmp1!=None and tmp2==None: tmp=tmp1;
if tmp1==None and tmp2!=None: tmp=tmp2;
result.append(tmp);
return result;
# The above are the functions as modules
# The following is the main entry
def start():
print "this software is developed by Li Liang at GDUFS."
print "This script is only for the 'target' folder in the NLTK folder."
print "Please copy your corpus folder into the 'target' folder."
filelist=scan_files();
for onefile in filelist:
do_onefile(onefile);
print "finished: "+onefile
print str(len(filelist))+" file(s) already processed, please check."