Is a preset linguistic motivation for collecting speech data justifiable?
The second problem goes whether the sampling of the target speech data is to be theoretically motivated. A special-purpose corpus compilation is usually directed to a certain research objective, because it is not economical and practical to make a small corpus all-inclusive and all-embracing.
The speech data from fieldwork will ultimately be shaped by not only the language itself but by the research goals we aim to achieve. For instance, in situated adolescent spoken corpus, we want to investigate the discourse markers from the prosodic perspective. Therefore we need to record more casual talk, instead of formal speech or sociolinguistic interview. If the purpose is on the language of urban adolescent speakers, the sampling is confined to this particular type of population.
Some people would argue that it is myopic to limit the record to the data pertinent to issues of current theoretical interests, but we have to check our recording quantity. We cannot hope to anticipate all future needs (Mithun 2001:53), theory gives us much on methodological issues, helps us find finer things to look at. This problem again points to our discussion of the relationship between data and theory. It is not appropriate to say that we set the theoretical framework for natural data to fit it; it is economical in actual field research to include a general theoretical orientation of data collection.
Linguistics benefits when fieldworkers are doing more than merely gathering data for a theoretician to interpret (Everett forthcoming). We understand Everett as meaning linguistic theory modifies our corpus planning, narrows our categories of samples.
By linguistic motivation, generally we mean given the funding and energy we have, what priority should be given to certain genre or register of discourse. As in the Corpus of Situated Adolescent Speech, if our object of investigation is on phonetic and/or phonological aspects of discourse, we need to find less noisy settings so as to obtain higher quality audio recording.
In a sense, the identity of a corpus is shaped before it actually comes into being. A corpus is by its very nature a purpose-built linguistic databank.