Methodological Issues Regarding Corpus-based Analyses of Linguistic Variation
DOUGLAS BIBER1
1 Northern Arizona University, USA
Correspondence: D. Biber, Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA.
Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text ‘genres’ as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues. It focuses on four particular methodological issues: (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliably represent the linguistic characteristics of that category, and related questions concerning the validity of ‘genre’ categories; (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identity and analyze the salient parameters of variation among texts. These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lund corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation. In conclusion, the paper welcomes the future availability of larger and more representative corpora, but it also urges researchers to fully exploit existing corpora for ongoing investigations of linguistic variation.
DOUGLAS BIBER1
1 Northern Arizona University, USA
Correspondence: D. Biber, Department of English, Northern Arizona University, Flagstaff, AZ 86011-6032, USA.
Although corpus-based analyses of linguistic variation have provided fresh insights into previously intractable issues, several methodological criticisms have been raised about the overall design of text corpora and the validity of text ‘genres’ as a basis for analyses of variation. Unfortunately, most of these criticisms have been based on intuitive judgements rather than empirical investigation. The present study begins to correct this lack of evidence concerning these issues. It focuses on four particular methodological issues: (1) how long texts should be in order to reliably represent the distribution of linguistic features in particular text categories; (2) how many texts within each text category are required in order to reliably represent the linguistic characteristics of that category, and related questions concerning the validity of ‘genre’ categories; (3) how many texts are needed in a corpus to accurately identify the salient parameters of linguistic variation among texts; and (4) how much of a cross-section is required to identity and analyze the salient parameters of variation among texts. These issues are addressed through statistical investigation of the distribution of linguistic features across various sub-samples of the LOB and London-Lund corpora, in comparison to their distribution across the full corpora. The results indicate that existing corpora are adequate for many analyses of linguistic variation. In conclusion, the paper welcomes the future availability of larger and more representative corpora, but it also urges researchers to fully exploit existing corpora for ongoing investigations of linguistic variation.