回复: Open corpus platform
I would wish you still say that after you have tried it.
For the platform, we are suggesting the following arguments:
1) What texts should be involved in a corpus?
Any texts that meet the demands of the user. There is no such a corpus that can meet the demands of all the users, particularly when a researcher has a specific goal for analysis, he or she must have a set of texts on mind for his or her particular use. Such being the case, any corpus available might be skewed in one way or another. One idea jingling on almost every corpus user’s mind would be: why not build one of my own? Or as a compromise, can I construct a subcorpus from a larger and miscellaneous collection of texts?
2) To what extent is the markup scheme of a corpus neatly adequate, so that finer queries can always be made on the basis of text classifications?
Actually no markup scheme can do everything. A single researcher can only feel comfortable with his or her own scheme: the information in need is always properly recorded with annotation. Discoursal markers or error tags that are manually inserted in the texts also need to be assisted by the computer and retrieved at a later stage. Some corpora are heavily annotated, but too much for an ordinary user, and others scarcely annotated.
3) If one constructs his or her own corpus, how to secure the representativeness of the sampled texts?
The issue of representativeness of the corpus texts is a little bit overemphasized, argued Teubert (2008). I can’t agree more. In the case of a big corpus for general purposes, it is usually of importance to maintain a good balance in text selection, to make sure that the samples from different genres are representative of the texts at large. But for the collection of texts of a specified genre, how many texts are selected is often more important than what texts are included. I even venture to argue that whether we actually need a corpus for general purposes. Whenever we do a research, we do it for specific purposes on well-defined sets of texts. Do we have a research for general purposes after all? So our suggestion would be that one does not have to worry too much about representativeness. If one has to have something to worry about, think of where to look for the texts and how many.
4) Technology: does one have to be a software engineer to be a corpus researcher?
Definitely not. Wolfgang once joked that if you were too technologically sophisticated, you could play down your fame as a linguist. Why not something there that one can forget all about the technology, and concentrate on the text analysis?
5) About kwic analysis: is it already good enough?
It is of its nature an analysis of the specified features in an area – the mini text. And the mini text is not an integral whole; it collapses easily when we try to locate each concordance line, which is called forth from a text of different kind. When we examine a line, we need to know exactly what kind of text it is from – with what annotation.