[original] A perl concordancer

dzhigner

Moderator
I built this concordancer two years ago when I was crazy about Perl. This is the source code.

use warnings;
use strict;
use IO::File;
use File::Find;
use Getopt::Std;
use vars qw(
$usage
$SearchPattern
$filnm
$context
%kwicc
@kwic
$opt_l
$opt_r
$opt_c
$opt_m
$opt_t
$i);

$usage=<<EOF;
usage: skwic [ -lrct ][-m NUM] <pattern> <path>
-l: Sort by left context
-r: Sort by right context
-c: Sort by pattern
-m: Line limitation
-t: Display collocations
(In this version if you use option -t, search pattern
is restricted to a single word.)
EOF

die $usage unless scalar(@ARGV)>1;
$SearchPattern = shift;
CheckReMsg($SearchPattern);
$filnm=shift;
getopts( "lrctm:" );
if ($opt_t && $SearchPattern=~ /[^\w\d'-\\]/){die "In this version if you use option -t, search pattern is restricted to a single word.";}
if(($opt_l && $opt_r)||($opt_l && $opt_c)||($opt_c && $opt_r))
{die "argument error.";}
$context=40;
$i=0;

sub CheckReMsg{
my $pattern = shift;
die "Illegal pattern: $@ \n" unless my $re = compile_re($pattern);
}

sub compile_re{
my $pattern = shift;
my $re;
eval {$re = qr/$pattern/;};
return $re;
}

sub parsewL {
if ($opt_m){
if ($i==$opt_m){goto L;}}
my $rawstr=shift;
my @wordsb=();
while ($rawstr=~/\b[\w\d'-]+\b/gi){
push @wordsb, $&."_$i";
}
$i++;
return $wordsb[$#wordsb];
}
sub parsewR {
if ($opt_m){
if ($i==$opt_m){goto L;}}
my $rawstr=shift;
my @wordsb=();
while ($rawstr=~/\b[\w\d'-]+\b/gi){
push @wordsb, $&."_$i";}
$i++;
return $wordsb[0];
}

sub domatch
{
my $myfh=shift;
my $name=shift;
warn "$name is a directory\n" and return if -d $name;
local $/=undef;
while(<$myfh>)
{
$_ =~ s/\n/ /g;
while($_=~/$SearchPattern/g){
my $prepttn="";
my $postpttn="";
if(length($`)<=$context){$prepttn=$`;}
elsif(length($`)>$context){$prepttn=substr($`,-$context);}
if (length($')<=$context){$postpttn=$';}
elsif(length($')>$context){$postpttn=substr($',0,$context);}
no warnings;
if ($opt_l){$kwicc{parsewL($prepttn)}=" "x($context-length($prepttn)).$prepttn.$&.$postpttn."\n";
}
elsif ($opt_r){$kwicc{parsewR($postpttn)}=" "x($context-length($prepttn)).$prepttn.$&.$postpttn."\n";
}
elsif ($opt_c) {$kwicc{$&."_$i"}=" "x($context-length($prepttn)).$prepttn.$&.$postpttn."\n";
$i++;
if ($i==$opt_m){goto L;}
}

else {print " "x($context-length($prepttn)).$prepttn.$&.$postpttn."\n" ;
push @kwic," "x($context-length($prepttn)).$prepttn.$&.$postpttn."\n";
$i++;
if ($i==$opt_m){exit;}
}
}
}
close $myfh;
}

if ( -d $filnm )
{
find(
sub {
return unless -T;
my $fh = IO::File->new( $_ );
domatch $fh, $File::Find::name;
},
$filnm
);
}
else
{
use warnings;
open(my $fh, $filnm) or die "Unable to open: $! \n";
domatch $fh, $filnm;
}

L:
if ($opt_l || $opt_r || $opt_c){
foreach(sort keys %kwicc){
push @kwic, $kwicc{$_};
print $kwicc{$_};

}
}



if ($opt_t){collocate($SearchPattern, @kwic);}
sub collocate {
no warnings;
my ($keyword, @kwicfile)=@_;
my @token=();
my %count=();
my @gset;
foreach (@kwicfile){
$_ =~ s/\s+/ /g;
while (/(\b[\w\d'-]+\b)\s(\b[\w\d'-]+\b\s){0}\b$keyword\b/gi){
my $buff=$1;
$buff=~tr/A-Z/a-z/;
$count{"$buff..."}++;}


while (/\b$keyword\b(\s\b[\w\d'-]+\b){0}\s(\b[\w\d'-]+\b)/gi){
my $buff=$2;
$buff=~tr/A-Z/a-z/;
$count{"...$buff"}++;}


}
print"\n--------------------------\n";

while ((my $key, my $value) = each %count) {
push @gset,$value."_"."$key\n";
};
@gset=sort {int($a)<=>int($b)or $a cmp $b} @gset;
print reverse @gset;
}

http://www.corpus4u.org/upload/forum/2005080405494198.rar
 

动态语法

管理员
Staff member
回复:[original] A perl concordancer

Thanks for sharing the script, dzhigner.

I have tried it under the DOS window (see picture below).




However, I am more interested in using it from within another
Windows program, I wonder if you could modify it so that

1) it could be called upon from another running program (I'm using NoteTab,
which has this function built in);

2) Perl will excute the script to work on the current open text file; and

3) the search item is the one that is currently being highlighted in the open text.

Sorry if this is asking too much.

For your information, here is the codes in NoteTab to run a Perl script
within the library (not an external script as yours):


H=";Scripts"


H="Perl script"
;Check if a Perl interpreter has been setup. If not, jump to the label NoPerl
^!IfFileExist ^$GetPerlExe$ Next ELSE NoPerl
;Delete the file "StdError.err" from the NoteTab program directory, if it exists
^!DeleteFile ^$GetAppPath$StdError.err
;Execute the Perl interpreter with the script contained in the Clip labeled "NumLines"
^!RunPerl Perl NumLines
;If the interpreter generated error messages, load them from the file "StdError.err"
^!IfFileExist ^$GetAppPath$StdError.err Next ELSE End
^!Open ^$GetAppPath$StdError.err
^!Prompt Perl has reported some errors!
^!GoTo End
:NoPerl
^!Prompt Cannot execute this Clip without a Perl interpreter!


H="_Perl NumLines"
#
#?\perl\numline.pl
#
# begin
while (<>) {
$x=sprintf("%05d",$.);
print "$x $_";
}
# end
 

dzhigner

Moderator
Sorry,现在恐怕做不到了,好久没用过Perl了,生疏了。现在只能用 VB.NET 来做这种工具。关于Concordaner我有一个问题,比如像您贴的图片所显示的那样,或多或少有上下文重复,这是不是一个问题?我用其他一些concordancer检索较短的文本做过试验,很多也是这样的,

但是如果把concordancing作为统计分析的材料,重复的上下文是不是影响分析结果的精确性?我认为,应该确保把重复限制在“小文本”之外,我一直在想如何解决这个问题,不过还没找到好的办法。
 

xiaoz

永远的超级管理员
Staff member
回复:[original] A perl concordancer

Repetition of this kind is not a problem, as what concordancers should do is to extract ALL instances of a search term.


以下是引用 dzhigner2005-8-4 17:08:32 的发言:
Sorry,现在恐怕做不到了,好久没用过Perl了,生疏了。现在只能用 VB.NET 来做这种工具。关于Concordaner我有一个问题,比如像您贴的图片所显示的那样,或多或少有上下文重复,这是不是一个问题?我用其他一些concordancer检索较短的文本做过试验,很多也是这样的,

但是如果把concordancing作为统计分析的材料,重复的上下文是不是影响分析结果的精确性?我认为,应该确保把重复限制在“小文本”之外,我一直在想如何解决这个问题,不过还没找到好的办法。
 

动态语法

管理员
Staff member
回复:[original] A perl concordancer

Right, as xiaoz said, this is not really a problem for concordancing as
concordancing is a token based exercise: each and every instance of
a token use is supposed to be displayed. It would be a problem if you
have this on other occasions, such as in word lists, where types are
what matters.


[本贴已被 作者 于 2005年08月04日 23时40分43秒 编辑过]
 

动态语法

管理员
Staff member
回复:[original] A perl concordancer

以下是引用 dzhigner2005-8-4 17:08:32 的发言:
Sorry,现在恐怕做不到了,好久没用过Perl了,生疏了。现在只能用 VB.NET 来做这种工具。
没关系。不是什么重要的事。
 
顶部