How to add compound words to the tagger in NLTK?

Question

So, I was wondering if anyone had any idea how to combine multiple terms to create a single term in the taggers in NLTK. .

For example, when I do:

nltk.pos_tag(nltk.word_tokenize('Apple Incorporated is the largest company'))

It gives me:

[('Apple', 'NNP'), ('Incorporated', 'NNP'), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), ('company', 'NN')]

How do I make it put 'Apple' and 'Incorporated' Together to be ('Apple Incorporated','NNP')

Answer 1

You could try taking a look at nltk.RegexParser . It allows you to chunk part of speech tagged content based on regular expressions. In your example, you could do something like

pattern = "NP:{<NN|NNP|NNS|NNPS>+}"
c = nltk.RegexpParser(p)
t = c.parse(nltk.pos_tag(nltk.word_tokenize("Apple Incorporated is the largest company")))
print t

This would give you:

Tree('S', [Tree('NP', [('Apple', 'NNP'), ('Incorporated', 'NNP')]), ('is', 'VBZ'), ('the', 'DT'), ('largest', 'JJS'), Tree('NP', [('company', 'NN')])])

Answer 2

The code is doing exactly what it is supposed to do. It is adding Part Of Speech tags to tokens. 'Apple Incorporated' is not a single token. It is two separate tokens, and as such can't have a single POS tag applied to it. This is the correct behaviour.

I wonder if you are trying to use the wrong tool for the job. What are you trying to do / Why are you trying to do it? Perhaps you are interested in identifying collocations rather than POS tagging? You might have a look here: collocations module

How to add compound words to the tagger in NLTK?

Question

2 answers

solution1
1 2013-06-10 12:25:55

solution2
0 2013-06-11 14:33:22

How to add compound words to the tagger in NLTK?

Question

2 answers

solution1 1 2013-06-10 12:25:55

solution2 0 2013-06-11 14:33:22

solution1
1 2013-06-10 12:25:55

solution2
0 2013-06-11 14:33:22