有关AntConc词的界定,急,请帮忙

各位大侠,急切需要帮忙。
我在antconc中lemmatizing词的定义中,如何设置token <word>definition 的选项,
如:大小写不区分,缩写为一个词,
P.E等为一个词呢????
 
回复: 有关antconc词的介定,急,请帮忙

AntConc的缺省设置是将a-zA-Z0-9这62个字符作为单词定义的元素,即常说的alphanumeric。

你要想把P.E也算作单词的话,就应该把.也加到的user defined list当中去,但这是tricky的问题,你把P.E.算作单词了,句末标点也会被算作单词的一部分了。

所以你自己做个抉择吧。

严格的处理,一般是文本分析前先做个tokenization比较好,这样,句末标点就会被分开。

你自己看看书,看看软件的说明,自己想办法解决一下吧。
 
回复: 有关antconc词的介定,急,请帮忙

是呀, 我在处理中就遇到这样那样的问题,那我要不区分大小写,缩写语为一个词呢
 
Re: 回复: 有关antconc词的介定,急,请帮忙

AntConc的缺省设置是将a-zA-Z0-9这62个字符作为单词定义的元素,即常说的alphanumeric。

你要想把P.E也算作单词的话,就应该把.也加到的user defined list当中去,但这是tricky的问题,你把P.E.算作单词了,句末标点也会被算作单词的一部分了。

所以你自己做个抉择吧。

严格的处理,一般是文本分析前先做个tokenization比较好,这样,句末标点就会被分开。

你自己看看书,看看软件的说明,自己想办法解决一下吧。

Sorry to correct you, but the default token definition is not a-zA-Z0-9. Strictly speaking, it is the Unicode standard "Letter" character class. This means it contains all the 'letters' of the worlds alphabets, including a-zA-Z but also Chinese, Korean, and Japanese 'letters' such as 火、单 etc.

However, it does *not* include numbers. So, 0-9 will not be included. Neither will Chinese numbers, such as 一,ニ,三,四,五 and so on.

This choice means that you know exactly what a token is defined as. Other systems (like WordSmith and MonoConc) seem to define tokens as what they are not. Basically, they use delimiters. But, this introduces lots of ambiguity when you try to process non-English texts.

To match the WordSmith/MonoConc style of token definition, just use the User defined token definition option, and type the characters that you want to include.

I hope that helps.
Laurence.
 
回复: 有关AntConc词的界定,急,请帮忙

Yes, in AntConc a word is defined as a string of alphabetic letters. Two years ago I wrote my own Perl script (using that definition) for Word List, and got the same results as AntConc ;)
 
Back
顶部