Here is the program I used to detag the BNC before retagging it using the C7 tagset. It removes everything other than the orignal texts and transcripts. you will need to install Perl in order to use the program, which is free. Then follow the steps below:
1) Make a new directory on the machine;
2) COPY the selected files to the dir;
3) Unzip the perl script into the same dir;
4) Double click the program file
A new file will be created for each BNC file, ending in .txt. These new files are what you want.
Warning: This program only works with BNC files.
http://www.corpus4u.org/upload/forum/2005122923225827.zip
Perl script written by xiaoz.
Activeperl一定要装,装上就行了。是很大。
Activeperl是装上了,怎么一运行BNCdetag,我里面的文件全部被清零了,就是只剩下文件夹空壳了,全部文件大小显示为都是0字节,晕那!什么原因呢?(附件是绿色版的perl editor)
不知道是否符合你的要求。欢迎拍砖。我在外地,否则我可以把自己的宏调整一下传给你用。
分析一下需要清除的内容,然后再调整一下工具即可。
I'm sorry this has happened, which is most unfortunate. It happened because the original filenames of the BNC files have been changed: original BNC filenames look like fb4, without the extension .txt instead of fb4.txt.
To process the BNC files with names like fb4.txt, you can modify lines 4 and 15 of the script as follows:
Line 4:
@files=grep (/\b[A-Z0-9]{3}\.txt\b/i, readdir (DIR));
Line 15:
$output="new_".$fn;
Then you will get the resultng files like new_fb4.txt, which are what you want.
做的很好,就是纯文本即可,但具体怎么做,如何调整宏之类的,请指教!
sorry, i still didn't get it. there are more than one file and there are more than one line in a file. how can i modify so many files and so many lines only by hand? is there any way that is more convinient to tranform the format once for all?
应该好解决。也发个样本上来诊断一下?对不起,我也有相同问题,不过不是BNC,是ICE-HK,我想把它的heading 和tagger 全去掉,试过用论坛上的detagger 工具,单个文档可以,但是批量不行,而且如果先用文档整理器合并后再detag 就卡再那里再也没反应了,试过几次,崩溃中。。。
请教各位老师了。我认真看了上面的帖,但是我是电脑白痴型,对上面大家说的很多程序或术语如堕云雾,请指教basic方法。
急切等待,万分感激。