pattern = r"""NP: {<NN.*><NN.*>+}"""
pattern = r"""NP: {<NN.*><NN.*>}
{<NN.*><NN.*><NN.*>}
{<NN.*><NN.*><NN.*><NN.*>}
{<NN.*><NN.*><NN.*><NN.*><NN.*>}
"""
import nltk
sent = nltk.word_tokenize("Again, it depends on whether the UK government decides to introduce a work permit system of the kind that currently applies to non-EU citizens, limiting entry to skilled workers in professions where there are shortages.")
tagged_sent = nltk.pos_tag(sent)
pattern = r"""NP: {<NN.*><NN.*>+}"""
cp = nltk.RegexpParser(pattern)
print cp.parse(tagged_sent)
(S
Again/RB
,/,
it/PRP
depends/VBZ
on/IN
whether/IN
the/DT
(NP UK/NNP government/NN)
decides/VBZ
to/TO
introduce/VB
a/DT
(NP work/NN permit/NN system/NN)
of/IN
the/DT
kind/NN
that/IN
currently/RB
applies/VBZ
to/TO
(NP non-EU/NNP citizens/NNS)
,/,
limiting/VBG
entry/NN
to/TO
skilled/JJ
workers/NNS
in/IN
professions/NNS
where/WRB
there/EX
are/VBP
shortages/NNS
./.)
For <NN.*>{2,} try
代码:pattern = r"""NP: {<NN.*><NN.*>+}"""
For <NN.*>{2,5} try
代码:pattern = r"""NP: {<NN.*><NN.*>} {<NN.*><NN.*><NN.*>} {<NN.*><NN.*><NN.*><NN.*>} {<NN.*><NN.*><NN.*><NN.*><NN.*>} """
Ugly but work.
Run the following command:
代码:import nltk sent = nltk.word_tokenize("Again, it depends on whether the UK government decides to introduce a work permit system of the kind that currently applies to non-EU citizens, limiting entry to skilled workers in professions where there are shortages.") tagged_sent = nltk.pos_tag(sent) pattern = r"""NP: {<NN.*><NN.*>+}""" cp = nltk.RegexpParser(pattern) print cp.parse(tagged_sent)
You get:
代码:(S Again/RB ,/, it/PRP depends/VBZ on/IN whether/IN the/DT (NP UK/NNP government/NN) decides/VBZ to/TO introduce/VB a/DT (NP work/NN permit/NN system/NN) of/IN the/DT kind/NN that/IN currently/RB applies/VBZ to/TO (NP non-EU/NNP citizens/NNS) ,/, limiting/VBG entry/NN to/TO skilled/JJ workers/NNS in/IN professions/NNS where/WRB there/EX are/VBP shortages/NNS ./.)