Re: 回复: 有关antconc词的介定,急,请帮忙
AntConc的缺省设置是将a-zA-Z0-9这62个字符作为单词定义的元素,即常说的alphanumeric。
你要想把P.E也算作单词的话,就应该把.也加到的user defined list当中去,但这是tricky的问题,你把P.E.算作单词了,句末标点也会被算作单词的一部分了。
所以你自己做个抉择吧。
严格的处理,一般是文本分析前先做个tokenization比较好,这样,句末标点就会被分开。
你自己看看书,看看软件的说明,自己想办法解决一下吧。
Sorry to correct you, but the default token definition is not a-zA-Z0-9. Strictly speaking, it is the Unicode standard "Letter" character class. This means it contains all the 'letters' of the worlds alphabets, including a-zA-Z but also Chinese, Korean, and Japanese 'letters' such as 火、单 etc.
However, it does *not* include numbers. So, 0-9 will not be included. Neither will Chinese numbers, such as 一,ニ,三,四,五 and so on.
This choice means that you know exactly what a token is defined as. Other systems (like WordSmith and MonoConc) seem to define tokens as what they are not. Basically, they use delimiters. But, this introduces lots of ambiguity when you try to process non-English texts.
To match the WordSmith/MonoConc style of token definition, just use the User defined token definition option, and type the characters that you want to include.
I hope that helps.
Laurence.