Open corpus platform

wzli

普通会员
Hi Pals,
We are developing an open corpus platform that allows any potential user to construct his or her own corpus with annotations for data retrieval and kwic analysis. We are planning to put it on a pressure test. You are welcome to join us.
Cheers,
WZ
 
回复: Open corpus platform

Is this a Christmas gift to us? But can you tell us a bit more about how to join?
 
回复: Open corpus platform

We are preparing the help files currently. Once done we'll provide some trial user ids and pwd so that those interested could try their hand. The Ocorp was designed as a workbench on which the users design their markup scheme and construct their collection of texts.
I am sorry to say that it is not actually a gift, but a series of highly experimental ideas put on trial. I'll let you know as soon as it is ready and accessible.
 
回复: Open corpus platform

Hi Pals,
We are developing an open corpus platform that allows any potential user to construct his or her own corpus with annotations for data retrieval and kwic analysis. We are planning to put it on a pressure test. You are welcome to join us.
Cheers,
WZ


Great news, I'm looking forward to trying the platform.

Thanks, Dr.Li.
 
回复: Open corpus platform

Great news indeed.

How should individual user contribute to your project? Will there be a web-based KWIC interface?
 
回复: Open corpus platform

I would wish you still say that after you have tried it.

For the platform, we are suggesting the following arguments:

1) What texts should be involved in a corpus?
Any texts that meet the demands of the user. There is no such a corpus that can meet the demands of all the users, particularly when a researcher has a specific goal for analysis, he or she must have a set of texts on mind for his or her particular use. Such being the case, any corpus available might be skewed in one way or another. One idea jingling on almost every corpus user’s mind would be: why not build one of my own? Or as a compromise, can I construct a subcorpus from a larger and miscellaneous collection of texts?
2) To what extent is the markup scheme of a corpus neatly adequate, so that finer queries can always be made on the basis of text classifications?
Actually no markup scheme can do everything. A single researcher can only feel comfortable with his or her own scheme: the information in need is always properly recorded with annotation. Discoursal markers or error tags that are manually inserted in the texts also need to be assisted by the computer and retrieved at a later stage. Some corpora are heavily annotated, but too much for an ordinary user, and others scarcely annotated.
3) If one constructs his or her own corpus, how to secure the representativeness of the sampled texts?
The issue of representativeness of the corpus texts is a little bit overemphasized, argued Teubert (2008). I can’t agree more. In the case of a big corpus for general purposes, it is usually of importance to maintain a good balance in text selection, to make sure that the samples from different genres are representative of the texts at large. But for the collection of texts of a specified genre, how many texts are selected is often more important than what texts are included. I even venture to argue that whether we actually need a corpus for general purposes. Whenever we do a research, we do it for specific purposes on well-defined sets of texts. Do we have a research for general purposes after all? So our suggestion would be that one does not have to worry too much about representativeness. If one has to have something to worry about, think of where to look for the texts and how many.
4) Technology: does one have to be a software engineer to be a corpus researcher?
Definitely not. Wolfgang once joked that if you were too technologically sophisticated, you could play down your fame as a linguist. Why not something there that one can forget all about the technology, and concentrate on the text analysis?
5) About kwic analysis: is it already good enough?
It is of its nature an analysis of the specified features in an area – the mini text. And the mini text is not an integral whole; it collapses easily when we try to locate each concordance line, which is called forth from a text of different kind. When we examine a line, we need to know exactly what kind of text it is from – with what annotation.
 
回复: Open corpus platform

Is this an initiative? Has the platform been fully set up? I think this is a great idea and I sort of have been thinking about a similar thing. I think an efficient protocol design for input data is crucial. Data format, processing level and many other factors should be taken into account.
 
回复: Open corpus platform

I think an open corpus platform is really a brilliant idea and effective instrument for all kinds of users, including both professional corpus researchers and many other users who can benefit from data and representative texts obtained from corpora such as language teachers, students, writers and translators.

Personally I am mainly interested in large scale general-purpose native speaker/writer corpus which can greatly benefit all kinds of users. This kind of corpus platform has to be large scale and there has to be a great number of contributors. Therefore there is the issue of markup scheme. I think there must be one but the solution is a multi-level markup scheme, implemented with user-friendly input interface if possible.

Here we don't talk much about text-processing-related issues but many aspects of corpus designing and construction are limited by current text-mining techniques, such as encoding, format of logical lines, regular expressions and etc. The lack of consideration of many technical details can add to the difficulty in designing such a large-scale system. One doesn't have to be a software engineer to use this platform but the work of planning and implementing this platform needs experts who know both corpus research and software engineer, at least on certain levels.

Personally I think it is impossible to design a perfect corpus platform. It is important to decide on what tasks computers handle best and what to leave to human judgments. And when the latter is vitally necessary in corpus analysis, KWIC interface should provide sorting and filtering functions.
 
Back
顶部