简体   繁体   English

使用Python从nltk树结构中提取特定叶值

[英]Extracting specific leaf value from nltk tree structure with Python

I have some questions about NLTK's tree functions. 我对NLTK的树函数有一些疑问。 I am trying to extract a certain word from the tree structure like the one given below. 我试图从树结构中提取某个单词,如下所示。

test = Tree.parse('(ROOT(SBARQ(WHADVP(WRB How))(SQ(VBP do)(NP (PRP you))(VP(VB ask)(NP(DT a)(JJ total)(NN stranger))(PRT (RP out))(PP (IN on)(NP (DT a)(NN date)))))))')

print "Input tree: ", test
print test.leaves()

(SBARQ
    (WHADVP (WRB How))
    (SQ
      (VBP do)
      (NP (PRP you))
      (VP
        (VB ask)
        (NP (DT a) (JJ total) (NN stranger))
        (PRT (RP out))
        (PP (IN on) (NP (DT a) (NN date)))))))

['How', 'do', 'you', 'ask', 'a', 'total', 'stranger', 'out', 'on', 'a', 'date']

I can find a list of all the words using the leaves() function. 我可以使用leaves()函数找到所有单词的列表。 Is there a way to get a specific leaf only? 有没有办法获得特定的叶子? For example: I would like to get the first/last noun from the NP phrase only? 例如:我想从NP短语中获取第一个/最后一个名词? The answer would be 'stranger' for the first noun and 'date' as the last noun. 答案对于第一个名词是“陌生人”而​​对于最后一个名词是“日期”。

Although noun phrases can be nested inside other types of phrases, I believe most grammars always have nouns in noun phrases. 虽然名词短语可以嵌套在其他类型的短语中,但我相信大多数语法总是在名词短语中使用名词。 So your question can probably be rephrased as: How do you find the first and last nouns? 所以你的问题可能会改为: 你怎么找到第一个和最后一个名词?

You can simply get all tuple s of words and POS tags and filter like this, 您可以简单地获取所有单词和POS标签的tuple并像这样过滤,

>>> [word for word,pos in test.pos() if pos=='NN']
['stranger', 'date']

Which in this case is only two so you're done. 在这种情况下只有两个,所以你已经完成了。 If you had more nouns, you would just index the list at [0] and [-1] . 如果你有更多的名词,你只需要在[0][-1]索引列表。


If you were looking for another POS that could be used in different phrases but you only wanted its use inside a particular one or if you had a strange grammar that allowed nouns outside of NPs, you can do the following... 如果您正在寻找可以在不同短语中使用的另一个POS,但您只想在特定的一个中使用它,或者如果您有一个允许在NP之外使用名词的奇怪语法,您可以执行以下操作...

You can find subtrees of 'NP' by doing, 你可以通过这样做找到'NP'subtrees

>>> NPs = list(test.subtrees(filter=lambda x: x.node=='NP'))
>>> NPs
[Tree('NP', [Tree('PRP', ['you'])]), Tree('NP', [Tree('DT', ['a']), Tree('JJ', ['total']), Tree('NN', ['stranger'])]), Tree('NP', [Tree('DT', ['a']), Tree('NN', ['date'])])]

Continuing to narrow down the subtrees, we can use this result to look for 'NN' words, 继续缩小子树,我们可以使用此结果来查找'NN'字样,

>>> NNs_inside_NPs = map(lambda x: list(x.subtrees(filter=lambda x: x.node=='NN')), NPs)
>>> NNs_inside_NPs
[[], [Tree('NN', ['stranger'])], [Tree('NN', ['date'])]]

So this is a list of list s of all the 'NN' s inside each 'NP' phrases. 所以这是每个'NP'短语中所有'NN' listlist In this case there happens to only be zero or one noun in each phrase. 在这种情况下,每个短语中恰好只有零个或一个名词。

Now we just need to go through the 'NP' s and get all the leaves of the individual nouns (which really means we just want to access the 'stranger' part of Tree('NN', ['stranger']) ). 现在我们只需要通过'NP'并得到各个名词的所有leaves (这实际上意味着我们只想访问Tree('NN', ['stranger'])'stranger'部分Tree('NN', ['stranger']) )。

>>> [noun.leaves()[0] for nouns in NNs_inside_NPs for noun in nouns]
['stranger', 'date']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM