简体   繁体   English

Python在nltk.tree中定位单词

[英]Python locate words in nltk.tree

I am trying to build a nltk to get the context of words. 我试图建立一个nltk来获取单词的上下文。 I have two sentences 我有两个句子

sentences=pd.DataFrame({"sentence": ["The weather was good so I went swimming", "Because of the good food we took desert"]})

I would like to find out, what the word "good" refers to. 我想找出“好”这个词是什么意思。 My idea is to chunk the sentences (code from tutorial here ) and then see if the word "good" and a noun are in the same node. 我的想法是对句子进行分块(来自此处的教程代码),然后查看单词“ good”和一个名词是否在同一节点中。 If not, it refers to a noun before or after that. 如果不是,则表示该名词之前或之后的名词。

First I build the Chunker as in the tutorial 首先,按照本教程中的说明构建块

from nltk.corpus import conll2000
test_sents = conll2000.chunked_sents('test.txt', chunk_types=['NP'])
train_sents = conll2000.chunked_sents('train.txt', chunk_types=['NP'])

class ChunkParser(nltk.ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t,c) for w,t,c in nltk.chunk.tree2conlltags(sent)]
            for sent in train_sents]
        self.tagger = nltk.TrigramTagger(train_data)
    def parse(self, sentence):
        pos_tags = [pos for (word,pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word,pos),chunktag)
        in zip(sentence, chunktags)]
        return nltk.chunk.conlltags2tree(conlltags)

NPChunker = ChunkParser(train_sents)

Then, I apply this on my sentences: 然后,将其应用到我的句子中:

sentence=sentences["sentence"][0]
tags=nltk.pos_tag(sentence.lower().split())
result = NPChunker.parse(tags)
print result

The result looks like this 结果看起来像这样

(S
  (NP the/DT weather/NN)
  was/VBD
  (NP good/JJ)
  so/RB
  (NP i/JJ)
  went/VBD
  swimming/VBG)

Now I would like to "find" in which node the word "good" is. 现在,我想“查找”单词“ good”在哪个节点上。 I have not really figured out a better way but counting the words in the nodes and in the leaves. 我还没有真正找到更好的方法,只是计算节点和叶子中的单词。 The word "good" is word number 3 in the sentence. 单词“ good”是句子中的单词3。

stuctured_sentence=[]
for n in range(len(result)):
    stuctured_sentence.append(list(result[n]))

structure_length=[]
for n in result:
    if isinstance(n, nltk.tree.Tree):               
        if n.label() == 'NP':
            print n
            structure_length.append(len(n))
    else:
        print str(n) +"is a leaf"
        structure_length.append(1)

From summing up the number of words, I know where the word "good" is. 通过总结单词的数量,我知道单词“ good”在哪里。

structure_frame=pd.DataFrame({"structure": stuctured_sentence, "length": structure_length})
structure_frame["cumsum"]=structure_frame["length"].cumsum()

Is there an easier way to determine the node or leaf of words and find out to which word "good" refers to? 有没有更简单的方法来确定单词的节点或叶,并找出“好”一词指的是什么?

Best Alex 最佳亚历克斯

It's easiest to find your word in a list of leaves. 在叶子列表中找到单词最容易。 You can then translate the leaf index into a tree index, which is a path down the tree. 然后,您可以将叶子索引转换为树索引,这是树下的路径。 To see what is grouped with good , go up one level and examine the subtree that this picks out. 要查看将good东西分组,请上一层并检查从中挑选出的子树。

First, find out the position of good in your flat sentence. 首先,找出平淡句子中good位置。 (You could skip this if you still had the untagged sentence as a list of tokens.) (如果您仍将未标记的句子作为标记列表,则可以跳过此步骤。)

words = [ w for w, t in result.leaves() ]

Now we find the linear position of good , and translate into a tree path: 现在我们找到good的线性位置,并转换为树路径:

>>> position = words.index("good")
>>> treeposition = result.leaf_treeposition(position)
>>> print(treeposition)
(2, 0)

A "treeposition" is a path down the tree, expressed as a tuple. “树位置”是沿着树的路径,表示为元组。 (NLTK trees can be indexed with tuples as well as integers.) To see the sisters of good , stop one step before you get to the end of the path. (NLTK树可以用元组和整数建立索引。)要查看good的姐妹们,请在到达路径末端之前停止一步。

>>> print(result[ treeposition[:-1] ])
Tree('NP', [('good', 'JJ')])

There you are. 你在这。 A subtree with one leaf, the pair (good, JJ) . 一棵只有一片叶子的子树,一对(good, JJ)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM