李亮1975重庆
语料库快乐军政委
Javascript-based Tagged-Word Cleaner
下载地址在我的百度网盘的分享页面
http://pan.baidu.com/share/home?uk=7...ategory/type=0
本帖附件中的压缩文件的里面是个HTM网页,其实相当于一个软件,你双击就看到界面了。界面就是1个文本框和4个按钮,你把需要处理的文本拷贝到文本框,然后点你需要的某个按钮就处理完毕了……
基于javascript的“被标注词之剔除器”的功能是剔除被标注词,在剔除之后,所有的标注标签就裸露出来了,要计算它们的数量和组合规律就把它们直接用AntConc检索吧,当然如果标注标签本身不是正常的单词而是有特殊符号的话,还必须在AntConc的“设置”的Token Definition“这个项目中进行相应设置哟。
本剔除器支持4种语料标签:下划线型、斜线型、反斜线型、方括号型。
<!DOCTYPE html><html><head>
<script>
function clean_underlined() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}_/gm," ");
document.getElementById("textbox1").value=output;
}
function clean_slashed() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}\//gm," ");
document.getElementById("textbox1").value=output;
}
function clean_backslashed() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}\\/gm," ");
document.getElementById("textbox1").value=output;
}
function clean_squarebracketed() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}\[/gm," [");
document.getElementById("textbox1").value=output;
}
</script>
</head><body>
<div style="font-size:35px;">Tagged-Word Cleaner (李亮制作)</div>
<textarea id="textbox1" cols="50" rows="10"></textarea><br /><br />
<input type="button" value="clean the underline-tagged words" onclick="clean_underlined()" /><br /><br />
<input type="button" value="clean the slash-tagged words" onclick="clean_slashed()" /><br /><br />
<input type="button" value="clean the backslash-tagged words" onclick="clean_backslashed()" /><br /><br />
<input type="button" value="clean the square-bracket-tagged words" onclick="clean_squarebracketed()" /><br /><br />
</body></html>
下载地址在我的百度网盘的分享页面
http://pan.baidu.com/share/home?uk=7...ategory/type=0
本帖附件中的压缩文件的里面是个HTM网页,其实相当于一个软件,你双击就看到界面了。界面就是1个文本框和4个按钮,你把需要处理的文本拷贝到文本框,然后点你需要的某个按钮就处理完毕了……
基于javascript的“被标注词之剔除器”的功能是剔除被标注词,在剔除之后,所有的标注标签就裸露出来了,要计算它们的数量和组合规律就把它们直接用AntConc检索吧,当然如果标注标签本身不是正常的单词而是有特殊符号的话,还必须在AntConc的“设置”的Token Definition“这个项目中进行相应设置哟。
本剔除器支持4种语料标签:下划线型、斜线型、反斜线型、方括号型。
<!DOCTYPE html><html><head>
<script>
function clean_underlined() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}_/gm," ");
document.getElementById("textbox1").value=output;
}
function clean_slashed() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}\//gm," ");
document.getElementById("textbox1").value=output;
}
function clean_backslashed() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}\\/gm," ");
document.getElementById("textbox1").value=output;
}
function clean_squarebracketed() {
var str=document.getElementById("textbox1").value;
var tmp=str.replace(/\n/," \n ");
tmp=" "+tmp;
var output=tmp.replace(/\s\w{1,}\[/gm," [");
document.getElementById("textbox1").value=output;
}
</script>
</head><body>
<div style="font-size:35px;">Tagged-Word Cleaner (李亮制作)</div>
<textarea id="textbox1" cols="50" rows="10"></textarea><br /><br />
<input type="button" value="clean the underline-tagged words" onclick="clean_underlined()" /><br /><br />
<input type="button" value="clean the slash-tagged words" onclick="clean_slashed()" /><br /><br />
<input type="button" value="clean the backslash-tagged words" onclick="clean_backslashed()" /><br /><br />
<input type="button" value="clean the square-bracket-tagged words" onclick="clean_squarebracketed()" /><br /><br />
</body></html>