Future Directions in Corpus Linguistics
- Tony McEnery/ Patricia Shaw
Video Summary
The following summary provides an overview of the discussion at the CASTA 2005 Forum on Corpus Linguistics, October 3rd, 2005. Video clips of the event are linked throughout the text. Where possible we have also provided links to PowerPoint materials and other references.
Introduction
Tony McEnery (Lancaster University) was introduced by John Newman, Chair of Linguistics at the University of Alberta. Tony McEnery has been working as a corpus linguist for many years. He is known for his Mandarin Chinese project, the results of which are available online. His most recent published book is on corpus-based language studies.
Lecture
McEnery started his talk by emphasizing the fact that the discussion about corpus linguistics would have been on a different scale 20 years ago. He gave numerous examples of earlier corpus linguistics experiences in the first section of his lecture, called "Yesterday". The collection of data was more difficult than it is now; by the mid-eighties a one-million-word corpus (video clip 1) was a great accomplishment.
The data at that time was used for simple purposes; there was no standardization of corpus markup schemes. The accessibility level of the data was low. Corpus linguists tools were primitive and they were created in various programming languages, functioning on different platforms. Automated annotation was available only for English. The usage of these tools was also limited to only describing the basic grammatical structure of the language. The results were less impressive for their research insight as they were as evidence that this kind of work could be done at all (video clip 2).
Finding data was extremely difficult. One couldn't easily get an electronic representation of textual data. The work involved manual access to the data, or one had to retrieve whatever they could get their hands on. Back in those times, there were only a few corpus linguists and they all knew each other. Corpus linguistics had limited dialog with mainstream linguistics (video clip 3) . The absence of standards (video clip 4) was a further limiting factor.
Tony McEnery also talked about difficulties of becoming a corpus linguist and how much of a demand it required from one in terms of professional background and training (video clip 5).
In the second part of his lecture, called "Yesterday's Tomorrow", McEnery discussed the problems that corpus linguistics thought were to be solved, and how it turned out quite differently as it went on solving them. He gave examples from his own work of how data and theory interact, and how linguistic theories must address the substantial evidence of real language use which corpora represent (video clip 6).
Corpus annotation is still not very commonly accessible and there are only a few languages for which corpus data is available. English language corpus building might present a good model for other languages (video clip 7).
Even carefully collected corpora have the danger of embedding artificial language situations because the written texts are often passed through a process of selection and editing (video clip 8).
In his third section, called "Today", McEnery outlined some of the persistent problems in today's corpus linguistic world. On a positive note, data of many world languages are now more accessible. Some problems have disappeared on their own, while some new ones have been introduced by computer software (video clip 9).
Finally, under the heading of "Tomorrow", McEnery suggested that persistent corpus problems should be dealt with, and that software development must accelerate. With the arrival of XML & Unicode, the payoff of encoded corpus language data is more clear than ever. Finding a standardized format for working with all the languages of the world should be a research priority. And linguistic theory which is not corpus-based or corpus-referenced should be challenged constantly (video clip 10).
Response
Patricia Shaw, from the University of British Colombia, acted as our invited respondent. She shared her professional experience (video clip 11) with the world of corpus linguistics and talked about how the field is vital to the efforts of studying many spoken First Nations dialects for which we no longer have native speakers (video clip 12).
As a representative of the First Nations, Shaw believes that the documentation of these unique languages is extremely important, and must be based in the communities where the language is used (video clip 13).
Shaw ended by emphasizing that corpora and corpus-linguistic approaches are also essential to teaching the language (video clip 14).
NOTE:
1. The viddeo clips will be uploaded to corpus4u's gmail account later;
2. The PowerPoint presentation used in conjunction with this talk is available here:
http://forum.corpus4u.org/upload/forum/2006030714340942.pdf
- Tony McEnery/ Patricia Shaw
Video Summary
The following summary provides an overview of the discussion at the CASTA 2005 Forum on Corpus Linguistics, October 3rd, 2005. Video clips of the event are linked throughout the text. Where possible we have also provided links to PowerPoint materials and other references.
Introduction
Tony McEnery (Lancaster University) was introduced by John Newman, Chair of Linguistics at the University of Alberta. Tony McEnery has been working as a corpus linguist for many years. He is known for his Mandarin Chinese project, the results of which are available online. His most recent published book is on corpus-based language studies.
Lecture
McEnery started his talk by emphasizing the fact that the discussion about corpus linguistics would have been on a different scale 20 years ago. He gave numerous examples of earlier corpus linguistics experiences in the first section of his lecture, called "Yesterday". The collection of data was more difficult than it is now; by the mid-eighties a one-million-word corpus (video clip 1) was a great accomplishment.
The data at that time was used for simple purposes; there was no standardization of corpus markup schemes. The accessibility level of the data was low. Corpus linguists tools were primitive and they were created in various programming languages, functioning on different platforms. Automated annotation was available only for English. The usage of these tools was also limited to only describing the basic grammatical structure of the language. The results were less impressive for their research insight as they were as evidence that this kind of work could be done at all (video clip 2).
Finding data was extremely difficult. One couldn't easily get an electronic representation of textual data. The work involved manual access to the data, or one had to retrieve whatever they could get their hands on. Back in those times, there were only a few corpus linguists and they all knew each other. Corpus linguistics had limited dialog with mainstream linguistics (video clip 3) . The absence of standards (video clip 4) was a further limiting factor.
Tony McEnery also talked about difficulties of becoming a corpus linguist and how much of a demand it required from one in terms of professional background and training (video clip 5).
In the second part of his lecture, called "Yesterday's Tomorrow", McEnery discussed the problems that corpus linguistics thought were to be solved, and how it turned out quite differently as it went on solving them. He gave examples from his own work of how data and theory interact, and how linguistic theories must address the substantial evidence of real language use which corpora represent (video clip 6).
Corpus annotation is still not very commonly accessible and there are only a few languages for which corpus data is available. English language corpus building might present a good model for other languages (video clip 7).
Even carefully collected corpora have the danger of embedding artificial language situations because the written texts are often passed through a process of selection and editing (video clip 8).
In his third section, called "Today", McEnery outlined some of the persistent problems in today's corpus linguistic world. On a positive note, data of many world languages are now more accessible. Some problems have disappeared on their own, while some new ones have been introduced by computer software (video clip 9).
Finally, under the heading of "Tomorrow", McEnery suggested that persistent corpus problems should be dealt with, and that software development must accelerate. With the arrival of XML & Unicode, the payoff of encoded corpus language data is more clear than ever. Finding a standardized format for working with all the languages of the world should be a research priority. And linguistic theory which is not corpus-based or corpus-referenced should be challenged constantly (video clip 10).
Response
Patricia Shaw, from the University of British Colombia, acted as our invited respondent. She shared her professional experience (video clip 11) with the world of corpus linguistics and talked about how the field is vital to the efforts of studying many spoken First Nations dialects for which we no longer have native speakers (video clip 12).
As a representative of the First Nations, Shaw believes that the documentation of these unique languages is extremely important, and must be based in the communities where the language is used (video clip 13).
Shaw ended by emphasizing that corpora and corpus-linguistic approaches are also essential to teaching the language (video clip 14).
NOTE:
1. The viddeo clips will be uploaded to corpus4u's gmail account later;
2. The PowerPoint presentation used in conjunction with this talk is available here:
http://forum.corpus4u.org/upload/forum/2006030714340942.pdf