I have put a zip file on my website (http://mcs.open.ac.uk/dh5368/) it
contains a list of inflection-lemma mappings, lemma-inflection mappings
and a file called singles.txt which contains forms in the lexicon that
could not be reduced.
The data was extracted from the CUVPlus lexicon by running a lemmatising
algorithm to reduce every entry in the lexicon and checking the
resulting proposed lemmas against the lexicon.
The file lemmas.txt contains inflection-lemma mappings that were
corroborated by the lexicon and inflect.txt contains the inverse
mappings. These files include words that are already in base form.
The singles.txt file contains word forms that judging by the tag should
be reducible but for which no proposed lemma could be found in the
lexicon. Most are adverbs that have no adjective base form, many are
non-count plural forms. There are also some (BNC) tagging errors,
misspellings and rare word forms. I have included the BNC frequency for
each entry from the lexicon as most of the noise is of low frequency.
Please note that this means that words not covered by the CUVPlus
lexicon do not appear in the mappings.
All the entries in the files are tagged using the C7 tagset.
The data is work in progress, but it is pretty clean I believe.
If you decide to use the mapping tables please cite my PhD thesis - it
is at Birkbeck College, University of London and due for submission
later this year.
Thank you,
Dave
----
Download link: http://mcs.open.ac.uk/dh5368/lemmas.zip
contains a list of inflection-lemma mappings, lemma-inflection mappings
and a file called singles.txt which contains forms in the lexicon that
could not be reduced.
The data was extracted from the CUVPlus lexicon by running a lemmatising
algorithm to reduce every entry in the lexicon and checking the
resulting proposed lemmas against the lexicon.
The file lemmas.txt contains inflection-lemma mappings that were
corroborated by the lexicon and inflect.txt contains the inverse
mappings. These files include words that are already in base form.
The singles.txt file contains word forms that judging by the tag should
be reducible but for which no proposed lemma could be found in the
lexicon. Most are adverbs that have no adjective base form, many are
non-count plural forms. There are also some (BNC) tagging errors,
misspellings and rare word forms. I have included the BNC frequency for
each entry from the lexicon as most of the noise is of low frequency.
Please note that this means that words not covered by the CUVPlus
lexicon do not appear in the mappings.
All the entries in the files are tagged using the C7 tagset.
The data is work in progress, but it is pretty clean I believe.
If you decide to use the mapping tables please cite my PhD thesis - it
is at Birkbeck College, University of London and due for submission
later this year.
Thank you,
Dave
----
Download link: http://mcs.open.ac.uk/dh5368/lemmas.zip