Python NLTK parse tagged text: how to retrieve the tagged text

Question

I'm new at NLTK and I'd like to experiment grammars parser for a my toy project.

Here is the code I use:

tokens = nltk.regexp_tokenize(test_sentence, ptrn_for_tokenizer, flags = flags )
tagged_text = regexp_tagger.tag(tokens)
only_tags = [tag for text, tag in tagged_text]
grammar = CFG.fromstring(GRAMMAR)
parser = nltk.ChartParser(grammar, trace=0)
trees = parser.parse(only_tags)

So I tokenize the text with regexs then using other regex I tag the text and finally I use the Parser to have the Syntax trees. But the Parse is done only with the tags (only_tags) and I cannot recover the tagged text.

How to do this? Is it the wrong way?

Answer 1

I understand your motivation in writing a grammar for just the POS tags: The NLTK's rule-based parsers don't have a place for a large vocabulary, since they're instructional tools not intended for real use. I'm not too sure what your parse trees look like, but if the POS tags are the leaf nodes, you can edit the tree and drop the words back in.

I'll first hand-code a sample tree similar to what your parser might give you:

mytree = nltk.Tree.fromstring("(S (DP D (AP A N)) (VP V))")

So here's how to put the words back in:

>>> tokens = "the big dog runs".split()
>>> for n, pos in enumerate(mytree.leaves()):
        mytree[mytree.leaf_treeposition(n)] = nltk.Tree(pos, [ tokens[n] ])
>>> print(mytree) 
(S (DP (D the) (AP (A big) (N dog))) (VP (V runs)))

Python NLTK parse tagged text: how to retrieve the tagged text

Question

1 answers

solution1
2 ACCPTED 2015-10-16 21:46:59

Python NLTK parse tagged text: how to retrieve the tagged text

Question

1 answers

solution1 2 ACCPTED 2015-10-16 21:46:59

solution1
2 ACCPTED 2015-10-16 21:46:59