[ZT and Download] English lemme list

xiaoz

永远的超级管理员
Staff member
I have put a zip file on my website (http://mcs.open.ac.uk/dh5368/) it
contains a list of inflection-lemma mappings, lemma-inflection mappings
and a file called singles.txt which contains forms in the lexicon that
could not be reduced.
The data was extracted from the CUVPlus lexicon by running a lemmatising
algorithm to reduce every entry in the lexicon and checking the
resulting proposed lemmas against the lexicon.
The file lemmas.txt contains inflection-lemma mappings that were
corroborated by the lexicon and inflect.txt contains the inverse
mappings. These files include words that are already in base form.
The singles.txt file contains word forms that judging by the tag should
be reducible but for which no proposed lemma could be found in the
lexicon. Most are adverbs that have no adjective base form, many are
non-count plural forms. There are also some (BNC) tagging errors,
misspellings and rare word forms. I have included the BNC frequency for
each entry from the lexicon as most of the noise is of low frequency.
Please note that this means that words not covered by the CUVPlus
lexicon do not appear in the mappings.
All the entries in the files are tagged using the C7 tagset.
The data is work in progress, but it is pretty clean I believe.
If you decide to use the mapping tables please cite my PhD thesis - it
is at Birkbeck College, University of London and due for submission
later this year.

Thank you,
Dave

----
Download link: http://mcs.open.ac.uk/dh5368/lemmas.zip
 
I have put a zip file on my website (http://mcs.open.ac.uk/dh5368/) it
contains a list of inflection-lemma mappings, lemma-inflection mappings
and a file called singles.txt which contains forms in the lexicon that
could not be reduced.
The data was extracted from the CUVPlus lexicon by running a lemmatising
algorithm to reduce every entry in the lexicon and checking the
resulting proposed lemmas against the lexicon.
The file lemmas.txt contains inflection-lemma mappings that were
corroborated by the lexicon and inflect.txt contains the inverse
mappings. These files include words that are already in base form.
The singles.txt file contains word forms that judging by the tag should
be reducible but for which no proposed lemma could be found in the
lexicon. Most are adverbs that have no adjective base form, many are
non-count plural forms. There are also some (BNC) tagging errors,
misspellings and rare word forms. I have included the BNC frequency for
each entry from the lexicon as most of the noise is of low frequency.
Please note that this means that words not covered by the CUVPlus
lexicon do not appear in the mappings.
All the entries in the files are tagged using the C7 tagset.
The data is work in progress, but it is pretty clean I believe.
If you decide to use the mapping tables please cite my PhD thesis - it
is at Birkbeck College, University of London and due for submission
later this year.

Thank you,
Dave

----
Download link: http://mcs.open.ac.uk/dh5368/lemmas.zip
 
:) Hi, could any has got the lemmas file posted here and share with me? It seems that the link does not work now. Thanks for Dr. Xiao and Dave for providing the post. My email is flyingbird07@yeah.net

Have a lovely weekend

Kate
 
Back
顶部