回复: 请教关于keyness的计算
谢谢volfer耐心指教。明白您说的通过larger reference corpus 来确定一个smaller corpus的主题的做法。刚才我在P. Baker的using corpora in DA,的p125关于keyness找到这句话:Using WordSmith, it is possible to compare the frequencies in one wordlist against another in order to determine which words occur statistically more often in wordlist A when compared with wordlist B and vice versa. Then all of the words that do occur more often than expected in one file when compared to another are compiled together into another list, called a keyword list.
他接着对比anti-hunting 和pro-hunting的两个subcorpus,得出一个图表(有keyness,p value),然后说“The first part shows words which occur more frequently in the anti-hunt speeches when compared to the pro-hunt speeches, while the opposite is true for the second part of the list.
Baker在书的较前面是有提到reference corpus,跟您的表述一样。所以关于keyness是不是有两个层次,一个是subcorpus参照larger reference corpus得出,一个是两个subcorpus之间的对比。我在思考的,与后者一样,即两个subcorpus之间的对比。所以比较文本A中相对于文本B中有unusual frequency的词汇,来反映涉及的主题的不同,还可行吗?希望volfer能继续帮忙解惑,谢谢!
很有趣的讨论。
刚才我翻了一下书,找到你引用的这段了。首先,Baker说的这段话,的确就是keywords的基本定义,就是两个word lists的比较,看看与B相比,A中有哪些词属于unusually high frequency(他称为positive keywords),哪些属于unusually low(他称为negative keywords)。
需要指出的是,Baker对keywords的定义与Scott在各篇文章里对keywords下的定义是完全一致的(只是Baker此处并没强调keywords的计算一定是一个小的观察文本对应一个大的参照语料库),因此也就不存在两重层次之说了。
Scott将reference corpus定义为an appropriate sample of the language which the text we are studying(the "node text" )is written in. An "appropriate" sample usually means a large one, preferably many thousands of words long and possibly much more.(Scott & Tribble 2006: 58)
关于reference corpus对keywords判定的影响,你可以读一读Key words of individual texts: aboutness and style (Scott & Tribble) 一文,收于Textual patterns: key words and corpus analysis in langugae education (2006 John Benjamins)一书。他们在该文中详细讨论了当选取不同的参照语料库时,从《罗密欧与朱丽叶》剧本中抽取出来的主题词的异同。
不知你有没有注意到,Baker在使用pro-hunting和anti-hunting文本进行比较时,并没有把后者称为前者的reference corpus。他在后文中倒是提到了reference corpus(P.137)他选取的是FLOB,他分析了FLOB的语言特点,认为它与hunting debate的语言特点基本一致,且比观察文本远远大出5倍不止。可见在reference corpus的定义上,Baker也与Scott是完全一致的。
那么Baker前半部分的研究做了些什么呢?我们需要观察A和B两个文本的特点:首先,它们是从同一场辩论(同一个大的语料库)里抽取出来的两部分(两个subcorpora),它们的大小大致相等(71,468 vs. 58,330),这两部分是Baker人为划分开的。他默认“there are two sides to the debate, and that by comparing one side against another we are likely to find a list of keywords which will then act as signposts to the underlying discourses within the debate on fox-hunting." 然而他同时也提到“Not all texts consist of so clear-cut positions. For example, a corpus of newspaper articles on the same subject might be best considered as an undifferentiated mass, unless the articles came from two or more different newspapers or were written at different times." (p. 137) 此时我们强迫自己聚焦于同一语料库中不同部分的区别,其实更需要关注的是语料库的整体特征(ibid.),下面他就开始采用FLOB作为reference corpus来做进一步的分析了。
综上所述,我的观点是你或许可以将你自建的A和B文本做对比,但必须满足以下几个条件:它们来自于不同的时间段(这点已经满足了),它们必须是同一主题,它们的大小必须大致相等。但即便如此,我们还是可能会因此忽略掉A和B共有的一些特征(主题词),这些都是你在分析时需要仔细考虑的问题。