简体   繁体   English

Python NLTK解析标记的文本:如何检索标记的文本

[英]Python NLTK parse tagged text: how to retrieve the tagged text

I'm new at NLTK and I'd like to experiment grammars parser for a my toy project. 我是NLTK的新手,我想为我的玩具项目尝试语法解析器。

Here is the code I use: 这是我使用的代码:

tokens = nltk.regexp_tokenize(test_sentence, ptrn_for_tokenizer, flags = flags )
tagged_text = regexp_tagger.tag(tokens)
only_tags = [tag for text, tag in tagged_text]
grammar = CFG.fromstring(GRAMMAR)
parser = nltk.ChartParser(grammar, trace=0)
trees = parser.parse(only_tags)

So I tokenize the text with regexs then using other regex I tag the text and finally I use the Parser to have the Syntax trees. 因此,我先使用正则表达式对文本进行标记,然后使用其他正则表达式对文本进行标记,最后使用解析器获取语法树。 But the Parse is done only with the tags (only_tags) and I cannot recover the tagged text. 但是,仅使用标签(only_tags)才能完成解析,而我无法恢复已标记的文本。

How to do this? 这个怎么做? Is it the wrong way? 这是错误的方式吗?

I understand your motivation in writing a grammar for just the POS tags: The NLTK's rule-based parsers don't have a place for a large vocabulary, since they're instructional tools not intended for real use. 我了解您为POS标签编写语法的动机:NLTK的基于规则的解析器没有放置大量词汇的空间,因为它们不是真正用于实际的教学工具。 I'm not too sure what your parse trees look like, but if the POS tags are the leaf nodes, you can edit the tree and drop the words back in. 我不太确定您的解析树是什么样子,但是如果POS标签是叶节点,则可以编辑树并将单词放回去。

I'll first hand-code a sample tree similar to what your parser might give you: 首先,我将手动编写一个类似于解析器可能为您提供的示例树:

mytree = nltk.Tree.fromstring("(S (DP D (AP A N)) (VP V))")

So here's how to put the words back in: 所以这是将单词放回去的方法:

>>> tokens = "the big dog runs".split()
>>> for n, pos in enumerate(mytree.leaves()):
        mytree[mytree.leaf_treeposition(n)] = nltk.Tree(pos, [ tokens[n] ])
>>> print(mytree) 
(S (DP (D the) (AP (A big) (N dog))) (VP (V runs)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM