简体   繁体   中英

Python NLTK parse tagged text: how to retrieve the tagged text

I'm new at NLTK and I'd like to experiment grammars parser for a my toy project.

Here is the code I use:

tokens = nltk.regexp_tokenize(test_sentence, ptrn_for_tokenizer, flags = flags )
tagged_text = regexp_tagger.tag(tokens)
only_tags = [tag for text, tag in tagged_text]
grammar = CFG.fromstring(GRAMMAR)
parser = nltk.ChartParser(grammar, trace=0)
trees = parser.parse(only_tags)

So I tokenize the text with regexs then using other regex I tag the text and finally I use the Parser to have the Syntax trees. But the Parse is done only with the tags (only_tags) and I cannot recover the tagged text.

How to do this? Is it the wrong way?

I understand your motivation in writing a grammar for just the POS tags: The NLTK's rule-based parsers don't have a place for a large vocabulary, since they're instructional tools not intended for real use. I'm not too sure what your parse trees look like, but if the POS tags are the leaf nodes, you can edit the tree and drop the words back in.

I'll first hand-code a sample tree similar to what your parser might give you:

mytree = nltk.Tree.fromstring("(S (DP D (AP A N)) (VP V))")

So here's how to put the words back in:

>>> tokens = "the big dog runs".split()
>>> for n, pos in enumerate(mytree.leaves()):
        mytree[mytree.leaf_treeposition(n)] = nltk.Tree(pos, [ tokens[n] ])
>>> print(mytree) 
(S (DP (D the) (AP (A big) (N dog))) (VP (V runs)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM