求助: Whar are the requirements for the reference corpus?

I want to study the keywords in SWECCL, and could not find an appropriate reference corpus. What are the requirements for the reference corpus besides its bigger size? Are there any requirements for its contents? Thanks!
 
回复: 求助: Whar are the requirements for the reference corpus?

You may want to read these two articles for the question on what the requirements for a reference corpus should be:

Article One: In Search of a Bad Reference Corpus

by Mike Scott

Abstract
What are the tolerable limits of similarity between a reference corpus and a node text for the generation of a useful set of keywords? There is of course considerable subjectivity in the notion of usefulness, which will vary according to research goals which cannot in general be predicted with certainty. Nevertheless, the aim here is to explore the ways in which the similarity between reference corpus and node text varies on certain important dimensions, such as size in tokens, similarity of text-type, similarity of historical period, similarity of subject-matter.

This paper starts from the formula proposed by Berber Sardinha (2004: 101-3) which suggests that the larger the reference corpus, the more keywords will be detected, and his formula for predicting the number of keywords produced with a given text and reference corpus. It also considers his recommendation that a reference corpus should be about five times the size of the node text.

Using a series of reference corpora, the paper explores keywords results in relation to specific texts. The aim is to identify not, as one might imagine, the characteristics of the good reference corpus, but the limits defining a poor one, since in many cases, e.g. the analysis of a dead language or a restricted corpus, the chance of accessing a good reference corpus is slim. The study represents work in progress and much further work needs to be done to confirm and develop its preliminary findings.

Download: http://www.methodsnetwork.ac.uk/redist/pdf/es1_05scott.pdf


Article Two: Word frequency and keyword extraction: rapporteur's report

by Marilyn Deegan

Read it at:
http://www.methodsnetwork.ac.uk/activities/es01rapporteur.html
 
回复: 求助: Whar are the requirements for the reference corpus?

These two papers are of great use.
 
回复: 求助: Whar are the requirements for the reference corpus?

Here is another one:

Comparing corpora with WordSmith Tools: How large must the reference corpus be?

Tony BERBER-SARDINHA
LAEL, Catholic University of Sao Paulo

Abstract

WordSmith Tools (Scott, 1998) offers a program for comparing corpora, known as KeyWords. KeyWords compares a word list extracted from what has been called 'the study corpus' (the corpus which the researcher is interested in describing) with a word list made from a reference corpus. The only requirement for a word list to be accepted as reference corpus by the software is that must be larger than the study corpus. one of the most pressing questions with respect to using KeyWords seems to be what would be the ideal size of a reference corpus. The aim of this paper is thus to propose answers to this question. Five English corpora were compared to reference corpora of various sizes (varying from two to 100 times larger than the study corpus). The results indicate that a reference corpus that is five times as large as the study corpus yielded a larger number of keywords than a smaller reference corpus. Corpora larger than five times the size of the study corpus yielded similar amounts of keywords. The implication is that a larger reference corpus is not always better than a smaller one, for WordSmith Tools Keywords analysis, while a reference corpus that is less than five times the size of the study corpus may not be reliable. There seems to be no need for using extremely large reference corpora, given that the number of keywords yielded do not seem to change by using corpora larger than five times the size of the study corpus.

You can download this paper and others in my online storage at:

http://corpuslaohong.ys168.com/
Password: corpus4u
 
回复: 求助: Whar are the requirements for the reference corpus?

I tested it a minute ago, and it's still working well. It shows that 32 people/times have logged in the account today. Please try it again later.
 
Back
顶部