回复: 求助: Whar are the requirements for the reference corpus?
You may want to read these two articles for the question on what the requirements for a reference corpus should be:
Article One: In Search of a Bad Reference Corpus
by Mike Scott
Abstract
What are the tolerable limits of similarity between a reference corpus and a node text for the generation of a useful set of keywords? There is of course considerable subjectivity in the notion of usefulness, which will vary according to research goals which cannot in general be predicted with certainty. Nevertheless, the aim here is to explore the ways in which the similarity between reference corpus and node text varies on certain important dimensions, such as size in tokens, similarity of text-type, similarity of historical period, similarity of subject-matter.
This paper starts from the formula proposed by Berber Sardinha (2004: 101-3) which suggests that the larger the reference corpus, the more keywords will be detected, and his formula for predicting the number of keywords produced with a given text and reference corpus. It also considers his recommendation that a reference corpus should be about five times the size of the node text.
Using a series of reference corpora, the paper explores keywords results in relation to specific texts. The aim is to identify not, as one might imagine, the characteristics of the good reference corpus, but the limits defining a poor one, since in many cases, e.g. the analysis of a dead language or a restricted corpus, the chance of accessing a good reference corpus is slim. The study represents work in progress and much further work needs to be done to confirm and develop its preliminary findings.
Download:
http://www.methodsnetwork.ac.uk/redist/pdf/es1_05scott.pdf
Article Two: Word frequency and keyword extraction: rapporteur's report
by Marilyn Deegan
Read it at:
http://www.methodsnetwork.ac.uk/activities/es01rapporteur.html