基于NLTK的屈折批量还原器.zip

李亮1975重庆

语料库快乐军政委
【使用指南】
[1] thinner.py是基于NLTK的python脚本,必须被拷贝到NLTK所在的文件夹里面;
[2] 你必须在NLTK文件夹里面,手工建立“target”名称的文件夹,这个脚本只对“target”文件夹的txt文件进行处理;
[3] 你必须把你的语料库所涉及到的文件和文件夹都拷贝到"target"文件夹,NLTK文件夹本来没有target文件夹,是你新建的哟;
[4] 在你拷贝了脚本和拷贝了语料库到“target”文件夹之后,就可以双击NLTK文件夹的python.exe(你也许看不到“.exe”这几个字符,那也是正常的),python.exe就弹出黑底白字的画面,你就输入下面的字符串……
import thinner; thinner.start();
输入了上面的内容就回车,就看到正在逐个TXT文件被处理且包括子文件夹的子文件夹的子文件夹……的所有TXT文件哟!
[5] 由于NLTK自身的原因,形容词和副词的比较级和最高级的屈折形式并不会被还原,而是保持原状,也就是说greater还是greater,greatest还是greatest

【下载地址】在我百度网盘的共享页面……
http://pan.baidu.com/share/home?uk=724520607&view=share#category/type=0

【源代码】
代码:
# this software is developed by Li Liang at GDUFS (Guangdong University of Foreign Studies)
# usage: this script needs to be named thinner.py
# step1: copy this script to the NLTK folder;
# step2: create a folder named "target", and copy your corpus folder into this new folder
# step3: double-click python.exe, and enter "import thinner; thinner.start(); ", and press ENTER
# You just need three steps above.
# The comparative and superlative forms of adj and adv are not changed due to NLTK's functionality. 
import os;
import nltk;
def get_files(newpaths):
# newpaths is a list;
# get files of one or more folders
    result=[];
    if len(newpaths)>0 :
        for newpath in newpaths:
            tmplist=os.listdir(newpath);
            for tmp in tmplist:
                if os.path.isfile(newpath+"/"+tmp) : result.append(newpath+"/"+tmp);        
    return result; 
def get_subfolders(newpaths):
# newpaths is a list;
# get direct subfolders of one or more folders
    result=[];
    if len(newpaths)>0 :
        for newpath in newpaths:
            tmplist=os.listdir(newpath);
            for tmp in tmplist:
                if os.path.isdir(newpath+"/"+tmp) : result.append(newpath+"/"+tmp);
    return result;
def get_recursive_subfolders(newpath):
# newpath is a string;
    result=[];
    mypath=[];
    mypath.append(newpath);
    while 1:
        tmp=get_subfolders(mypath);
        result=result+tmp;
        if len(tmp)==0: break;
        mypath=tmp;
    return result; 
def get_recursive_files(newpath):
# get all recursive files in the folder; 
    result=[];
    result=get_files([newpath]);
    result=result+get_files(get_recursive_subfolders(newpath));
    return result; 
def get_recursive_txtfiles(newpath):
    tmplist=get_recursive_files(newpath);
    result=[];
    for tmp in tmplist:
        tmppath,tmpname=os.path.split(tmp);
        tmpname=tmpname.lower();
        if tmpname.endswith(".txt"): result.append(tmp);
    return result;
def scan_files():
    result=get_recursive_txtfiles(os.getcwd()+"\\target");
    return result;
def do_onefile(filepath):
    linelist=open(filepath).readlines();
    for i in range(len(linelist)):
        linelist[i]=clear_inflection_for_one_paragraph(linelist[i].replace(". "," . "));
    open(filepath,"w").write("\r\n".join(linelist));
def clear_inflection_for_one_paragraph(one_paragraph):
    result="";
    tmp=clear_inflection_for_one_list(nltk.tokenize.word_tokenize(one_paragraph));
    result=" ".join(tmp);
    return result;
def clear_inflection_for_one_list(newlist): 
    result=[];
    for ele in newlist:
        tmp="";
        if len(ele)>1 and ele.istitle():
            ele=ele.lower();
        tmp1=nltk.corpus.wordnet.morphy(ele);
        tmp2=nltk.stem.WordNetLemmatizer().lemmatize(ele);
        if tmp1==None and tmp2==None: tmp=ele;
        if tmp1!=None and tmp2!=None and len(tmp1)>=len(tmp2): tmp=tmp2;
        if tmp1!=None and tmp2!=None and len(tmp1)<=len(tmp2): tmp=tmp1;
        if tmp1!=None and tmp2==None: tmp=tmp1;
        if tmp1==None and tmp2!=None: tmp=tmp2;
        result.append(tmp);
    return result;  
# The above are the functions as modules
# The following is the main entry
def start():
    print "this software is developed by Li Liang at GDUFS."
    print "This script is only for the 'target' folder in the NLTK folder."
    print "Please copy your corpus folder into the 'target' folder."
    filelist=scan_files();
    for onefile in filelist:
        do_onefile(onefile);
        print "finished: "+onefile
    print str(len(filelist))+" file(s) already processed, please check."
 

附件

  • 基于NLTK的屈折批量还原器.zip
    1.9 KB · 浏览: 20
回复: 基于NLTK的屈折批量还原器.zip

这个太好用了,那比较级和最高级有没有办法还原呢?
 
回复: 基于NLTK的屈折批量还原器.zip

我发现很多ing做非谓语和动名词都没有还原,比较级和最高级也没有还原,有没有方法能解决?李亮博士的treetagger在线版屈折还原器比这个NLTK屈折还原方法慢,而且常常说断网。
 
回复: 基于NLTK的屈折批量还原器.zip

我还检查到,此方法会把一些以-s结尾的词误认为名词复数进而去掉s,比如会把一些as,was去掉s变成a,wa,此方法对一些特殊的过去分词,如felt无感,不会还原。供大家参考
 
比较级和最高级可以还原的,要高正确率地实现必须编程

NLTK的词形还原器的默认还原办法是“按照规则还原”,按照规则就是根据单词的表面的拼写形式,这种方法的优点是速度超快,但正确率不是极高,肯定要出现一些错误,但是,另一种办法是基于数据库,但这种方法的编程更复杂,优点是正确率很高。

比较级和最高级的词形还原需要自己在NLTK中编程,我当前暂时没空做这方面的免费软件,可能下半年开始做这些。
 
回复: 基于NLTK的屈折批量还原器.zip

请问这种还原适用于德语么,我想对德语txt文本做屈折还原处理,感谢回复,再次感谢
 
Back
顶部