I wrote a Perl program, find-compounds.pl, to find the longest compound words of the text.
It is part of the Text-NSP package. The following link is the description.
http://search.cpan.org/~tpederse/Text-NSP-1.21/bin/utils/find-compounds.pl
The original text contains "This is the new york city". In the compound word list, it has
new_york
new_york_city
The find-compounds.pl will find the longest match. After replace the compound words, the text is "This is the new_york_city".
This code needs to input an offline ready list of the compound words you are interested in.
The output is the text file with compound words replaced. In order to pick out the sentences
which contain the compound words, you need to further process the output text. Hope this helpful.
Thanks,
Ying
Quote from Corpora List
It is part of the Text-NSP package. The following link is the description.
http://search.cpan.org/~tpederse/Text-NSP-1.21/bin/utils/find-compounds.pl
The original text contains "This is the new york city". In the compound word list, it has
new_york
new_york_city
The find-compounds.pl will find the longest match. After replace the compound words, the text is "This is the new_york_city".
This code needs to input an offline ready list of the compound words you are interested in.
The output is the text file with compound words replaced. In order to pick out the sentences
which contain the compound words, you need to further process the output text. Hope this helpful.
Thanks,
Ying
Quote from Corpora List