[英]SyntaxNet creating tree to root verb
I am new to Python and the world of NLP. 我是Python和NLP世界的新手。 The recent announcement of Google's Syntaxnet intrigued me.
谷歌的Syntaxnet最近的公告引起了我的兴趣。 However I am having a lot of trouble understanding documentation around both syntaxnet and related tools (nltk, etc.)
但是,我在理解有关syntaxnet和相关工具(nltk等)的文档时遇到了很多麻烦。
My goal: given an input such as "Wilbur kicked the ball" I would like to extract the root verb (kicked) and the object it pertains to "the ball". 我的目标:给出一个输入,如“威尔伯踢球”,我想提取根动词(踢)和它与“球”有关的对象。
I stumbled across "spacy.io" and this visualization seems to encapsulate what I am trying to accomplish: POS tag a string, and load it into some sort of tree structure so that I can start at the root verb and traverse the sentence. 我偶然发现了“spacy.io”, 这个可视化似乎封装了我想要完成的任务:POS标记一个字符串,并将其加载到某种树结构中,以便我可以从根动词开始并遍历句子。
I played around with the syntaxnet/demo.sh, and as suggested in this thread commented out the last couple lines to get conll output. 我使用了syntaxnet / demo.sh,并按照此线程中的建议注释掉最后几行以获得conll输出。
I then loaded this input in a python script (kludged together myself, probably not correct): 然后我将这个输入加载到python脚本中(我自己克服了一些,可能不正确):
import nltk
from nltk.corpus import ConllCorpusReader
columntypes = ['ignore', 'words', 'ignore', 'ignore', 'pos']
corp = ConllCorpusReader('/Users/dgourlay/development/nlp','input.conll', columntypes)
I see that I have access to corp.tagged_words(), but no relationship between the words. 我看到我可以访问corp.tagged_words(),但是没有关系。 Now I am stuck!
现在我卡住了! How can I load this corpus into a tree type structure?
如何将此语料库加载到树型结构中?
Any help is much appreciated! 任何帮助深表感谢!
This may have been better as a comment, but I don't yet have the required reputation. 作为评论,这可能更好,但我还没有所需的声誉。
I haven't used the ConllCorpusreader before (would you consider uploading the file you are loading to a gist and providing a link? It would be much easier to test), but I wrote a blog post which may help with the tree parsing aspect: here . 我之前没有使用过ConllCorpusreader(你会考虑将你正在加载的文件上传到一个要点并提供一个链接吗?它会更容易测试),但我写了一篇博文,可能有助于树解析方面: 在这里 。
In particular, you probably want to chunk each sentence. 特别是,你可能想要把每个句子分块。 Chapter 7 of the NLTK book has some more information on this, but this is the example from my blog:
NLTK书的第7章有关于此的更多信息,但这是我博客的例子:
# This grammar is described in the paper by S. N. Kim,
# T. Baldwin, and M.-Y. Kan.
# Evaluating n-gram based evaluation metrics for automatic
# keyphrase extraction.
# Technical report, University of Melbourne, Melbourne 2010.
grammar = r"""
NBAR:
# Nouns and Adjectives, terminated with Nouns
{<NN.*|JJ>*<NN.*>}
NP:
{<NBAR>}
# Above, connected with in/of/etc...
{<NBAR><IN><NBAR>}
"""
chunker = nltk.RegexpParser(grammar)
tree = chunker.parse(postoks)
Note: You could also use a Context Free Grammar (covered in Chapter 8 ). 注意:您也可以使用Context Free Grammar(在第8章中介绍 )。
Each chunked (or parsed) sentence (or in this example Noun Phrase, according to the grammar above) will be a subtree. 每个分块(或解析)的句子(或者在这个例子中,名词短语,根据上面的语法)将是一个子树。 To access these subtrees, we can use this function:
要访问这些子树,我们可以使用此功能:
def leaves(tree):
"""Finds NP (nounphrase) leaf nodes of a chunk tree."""
for subtree in tree.subtrees(filter = lambda t: t.node=='NP'):
yield subtree.leaves()
Each of the yielded objects will be a list of word-tag pairs. 每个产生的对象将是一个单词 - 标签对列表。 From there you can find the verb.
从那里你可以找到动词。
Next, you could play with the grammar above or the parser. 接下来,您可以使用上面的语法或解析器。 Verbs split noun phrases (see this diagram in Chapter 7 ), so you can probably just access the first
NP
after a VBD
. 动词拆分名词短语(见第7章中的这个图 ),所以你可以在
VBD
之后访问第一个NP
。
Sorry for the solution not being specific to your problem, but hopefully it is a helpful starting point. 很抱歉解决方案不是特定于您的问题,但希望它是一个有用的起点。 If you upload the file(s) I'll take another shot :)
如果你上传文件我会再拍一次:)
What you are trying to do is to find a dependency, namely dobj . 你要做的是找到一个依赖,即dobj 。 I'm not yet familiar enough with SyntaxNet/Parsey to tell you how exactly to go extracting that dependency from it's output, but I believe this answer might help you.
我还不熟悉SyntaxNet / Parsey,告诉你如何从它的输出中提取依赖性,但我相信这个答案可能对你有帮助。 In short, you can configure Parsey to use ConLL syntax for output, parse it into whatever you find easy to traverse, then look for ROOT dependency to find the verb and *obj dependencies to find its objects.
简而言之,您可以将Parsey配置为使用ConLL语法进行输出,将其解析为易于遍历的任何内容,然后查找ROOT依赖项以查找动词和* obj依赖项以查找其对象。
If you have parsed the raw text in the conll format using whatever parser, you can follow the steps to traverse the dependents of a node that you are interested in: 如果使用任何解析器解析了conll格式的原始文本,则可以按照步骤遍历您感兴趣的节点的依赖项:
PS: I can provide the code, but it would be better if you can code it yourself. PS:我可以提供代码,但如果您可以自己编写代码会更好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.