Follow up our earlier discussion on the issue about the conversion of different POS tagsets (http://www.corpus4u.com/forum_view.asp?forum_id=38&view_id=1819&page=1), here is one solution:
The AMALGAM project
(Automatic Mapping Among Lexico-Grammatical Annotation Models)
Many researchers in Linguistics have been gathering bodies (or corpora) of text that they want to analyse as a way of learning more about languages. It is believed that the more text we have, the more information we can gain. Many research groups have attached labels (or tags) to each word in their corpora, so that, for instance, 'The cat sat on the mat' has the correct grammatical labels attached - 'The' is a determiner, 'cat' is a noun, 'sat' is a verb and so on. However, not all researchers have used the same set of tags, which makes it difficult for these different research groups to work together.
For example, using 'The cat sat on the mat', researchers who use the ICE scheme will produce:
The/ART(def)
cat/N(com,sing)
sat/V(intr,past)
on/PREP(ge)
the/ART(def)
mat/N(com,sing)
./PUNC(per)
Researchers who use the LOB scheme will produce:
The/ATI
cat/NN
sat/VBD
on/IN
the/ATI
mat/NN
./.
This is an important problem because there is some evidence that the corpora which we currently have are not large enough for us to produce a general statistical model of grammatical structure. Even though these corpora contain hundreds of thousands, or even millions, of words, that is not enough. We need to collate them into an even larger corpus. This means that we need to find some way of mapping between one set of tags and the others so that we can join them together.
The AMALGAM project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in the research corpora described above. We plan to develop a Multi-tagged Corpus and Multi-Treebank, a single text-set annotated with all the above tagging and parsing schemes.
THE TAGSETS can be mapped :
Brown Corpus
International Corpus of English (ICE)
London-Lund Corpus (LLC)
Lancaster-Oslo/Bergen Corpus (LOB)
Unix Parts
Polytechnic of Wales Corpus (POW)
Spoken English Corpus (SEC)
University of Pennsylvania Corpus (UPenn)
See more at:
http://www.scs.leeds.ac.uk/amalgam/amalgam/amalghome.htm
The AMALGAM project
(Automatic Mapping Among Lexico-Grammatical Annotation Models)
Many researchers in Linguistics have been gathering bodies (or corpora) of text that they want to analyse as a way of learning more about languages. It is believed that the more text we have, the more information we can gain. Many research groups have attached labels (or tags) to each word in their corpora, so that, for instance, 'The cat sat on the mat' has the correct grammatical labels attached - 'The' is a determiner, 'cat' is a noun, 'sat' is a verb and so on. However, not all researchers have used the same set of tags, which makes it difficult for these different research groups to work together.
For example, using 'The cat sat on the mat', researchers who use the ICE scheme will produce:
The/ART(def)
cat/N(com,sing)
sat/V(intr,past)
on/PREP(ge)
the/ART(def)
mat/N(com,sing)
./PUNC(per)
Researchers who use the LOB scheme will produce:
The/ATI
cat/NN
sat/VBD
on/IN
the/ATI
mat/NN
./.
This is an important problem because there is some evidence that the corpora which we currently have are not large enough for us to produce a general statistical model of grammatical structure. Even though these corpora contain hundreds of thousands, or even millions, of words, that is not enough. We need to collate them into an even larger corpus. This means that we need to find some way of mapping between one set of tags and the others so that we can join them together.
The AMALGAM project is an attempt to create a set of mapping algorithms to map between the main tagsets and phrase structure grammar schemes used in the research corpora described above. We plan to develop a Multi-tagged Corpus and Multi-Treebank, a single text-set annotated with all the above tagging and parsing schemes.
THE TAGSETS can be mapped :
Brown Corpus
International Corpus of English (ICE)
London-Lund Corpus (LLC)
Lancaster-Oslo/Bergen Corpus (LOB)
Unix Parts
Polytechnic of Wales Corpus (POW)
Spoken English Corpus (SEC)
University of Pennsylvania Corpus (UPenn)
See more at:
http://www.scs.leeds.ac.uk/amalgam/amalgam/amalghome.htm