简体   繁体   English

SyntaxNet创建树到根动词

[英]SyntaxNet creating tree to root verb

I am new to Python and the world of NLP. 我是Python和NLP世界的新手。 The recent announcement of Google's Syntaxnet intrigued me. 谷歌的Syntaxnet最近的公告引起了我的兴趣。 However I am having a lot of trouble understanding documentation around both syntaxnet and related tools (nltk, etc.) 但是,我在理解有关syntaxnet和相关工具(nltk等)的文档时遇到了很多麻烦。

My goal: given an input such as "Wilbur kicked the ball" I would like to extract the root verb (kicked) and the object it pertains to "the ball". 我的目标:给出一个输入,如“威尔伯踢球”,我想提取根动词(踢)和它与“球”有关的对象。

I stumbled across "spacy.io" and this visualization seems to encapsulate what I am trying to accomplish: POS tag a string, and load it into some sort of tree structure so that I can start at the root verb and traverse the sentence. 我偶然发现了“spacy.io”, 这个可视化似乎封装了我想要完成的任务:POS标记一个字符串,并将其加载到某种树结构中,以便我可以从根动词开始并遍历句子。

I played around with the syntaxnet/demo.sh, and as suggested in this thread commented out the last couple lines to get conll output. 我使用了syntaxnet / demo.sh,并按照此线程中的建议注释掉最后几行以获得conll输出。

I then loaded this input in a python script (kludged together myself, probably not correct): 然后我将这个输入加载到python脚本中(我自己克服了一些,可能不正确):

import nltk
from nltk.corpus import ConllCorpusReader
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = ConllCorpusReader('/Users/dgourlay/development/nlp','input.conll', columntypes)

I see that I have access to corp.tagged_words(), but no relationship between the words. 我看到我可以访问corp.tagged_words(),但是没有关系。 Now I am stuck! 现在我卡住了! How can I load this corpus into a tree type structure? 如何将此语料库加载到树型结构中?

Any help is much appreciated! 任何帮助深表感谢!

This may have been better as a comment, but I don't yet have the required reputation. 作为评论,这可能更好,但我还没有所需的声誉。

I haven't used the ConllCorpusreader before (would you consider uploading the file you are loading to a gist and providing a link? It would be much easier to test), but I wrote a blog post which may help with the tree parsing aspect: here . 我之前没有使用过ConllCorpusreader(你会考虑将你正在加载的文件上传到一个要点并提供一个链接吗?它会更容易测试),但我写了一篇博文,可能有助于树解析方面: 在这里

In particular, you probably want to chunk each sentence. 特别是,你可能想要把每个句子分块。 Chapter 7 of the NLTK book has some more information on this, but this is the example from my blog: NLTK书的第7章有关于此的更多信息,但这是我博客的例子:

# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
NBAR:
  # Nouns and Adjectives, terminated with Nouns
  {<NN.*|JJ>*<NN.*>}

NP:
  {<NBAR>}
    # Above, connected with in/of/etc...
  {<NBAR><IN><NBAR>}
"""

chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)

Note: You could also use a Context Free Grammar (covered in Chapter 8 ). 注意:您也可以使用Context Free Grammar(在第8章中介绍 )。

Each chunked (or parsed) sentence (or in this example Noun Phrase, according to the grammar above) will be a subtree. 每个分块(或解析)的句子(或者在这个例子中,名词短语,根据上面的语法)将是一个子树。 To access these subtrees, we can use this function: 要访问这些子树,我们可以使用此功能:

def leaves(tree):
  """Finds NP (nounphrase) leaf nodes of a chunk tree."""
  for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
    yield subtree.leaves()

Each of the yielded objects will be a list of word-tag pairs. 每个产生的对象将是一个单词 - 标签对列表。 From there you can find the verb. 从那里你可以找到动词。

Next, you could play with the grammar above or the parser. 接下来,您可以使用上面的语法或解析器。 Verbs split noun phrases (see this diagram in Chapter 7 ), so you can probably just access the first NP after a VBD . 动词拆分名词短语(见第7章中的这个图 ),所以你可以在VBD之后访问第一个NP

Sorry for the solution not being specific to your problem, but hopefully it is a helpful starting point. 很抱歉解决方案不是特定于您的问题,但希望它是一个有用的起点。 If you upload the file(s) I'll take another shot :) 如果你上传文件我会再拍一次:)

What you are trying to do is to find a dependency, namely dobj . 你要做的是找到一个依赖,即dobj I'm not yet familiar enough with SyntaxNet/Parsey to tell you how exactly to go extracting that dependency from it's output, but I believe this answer might help you. 我还不熟悉SyntaxNet / Parsey,告诉你如何从它的输出中提取依赖性,但我相信这个答案可能对你有帮助。 In short, you can configure Parsey to use ConLL syntax for output, parse it into whatever you find easy to traverse, then look for ROOT dependency to find the verb and *obj dependencies to find its objects. 简而言之,您可以将Parsey配置为使用ConLL语法进行输出,将其解析为易于遍历的任何内容,然后查找ROOT依赖项以查找动词和* obj依赖项以查找其对象。

If you have parsed the raw text in the conll format using whatever parser, you can follow the steps to traverse the dependents of a node that you are interested in: 如果使用任何解析器解析了conll格式的原始文本,则可以按照步骤遍历您感兴趣的节点的依赖项:

  1. build an adjacency matrix from the output conll sentence 从输出conll语句构建邻接矩阵
  2. look for the node you are interested in (verb in your case) and extract its dependents from the adjacency matrix (indices) 查找您感兴趣的节点(在您的情况下为动词)并从邻接矩阵(索引)中提取其依赖项
  3. for each dependent look for its dependency label in the 8th column in the conll format. 对于conll格式的第8列中的依赖项标签的每个依赖查找。

PS: I can provide the code, but it would be better if you can code it yourself. PS:我可以提供代码,但如果您可以自己编写代码会更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM