简体   繁体   English

如何在NLTK中进行依赖项解析?

[英]How do I do dependency parsing in NLTK?

Going through the NLTK book, it's not clear how to generate a dependency tree from a given sentence. 遍历NLTK书,目前尚不清楚如何从给定的句子生成依赖树。

The relevant section of the book: sub-chapter on dependency grammar gives an example figure but it doesn't show how to parse a sentence to come up with those relationships - or maybe I'm missing something fundamental in NLP? 该书的相关部分: 依赖语法的第二章给出了一个示例图,但是它没有显示如何解析一个句子以解决这些关系-也许我在NLP中缺少一些基本知识?

EDIT: I want something similar to what the stanford parser does: Given a sentence "I shot an elephant in my sleep", it should return something like: 编辑:我想要类似于斯坦福解析器所做的事情:给一个句子“我睡着了射中一头大象”,它应该返回类似:

nsubj(shot-2, I-1)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)

We can use Stanford Parser from NLTK. 我们可以使用NLTK的Stanford Parser。

Requirements 要求

You need to download two things from their website: 您需要从他们的网站下载两件事:

  1. The Stanford CoreNLP parser . Stanford CoreNLP解析器
  2. Language model for your desired language (eg english language model ) 语言模型为您所需的语言(如英语语言模型

Warning! 警告!

Make sure that your language model version matches your Stanford CoreNLP parser version! 确保您的语言模型版本与您的Stanford CoreNLP解析器版本匹配!

The current CoreNLP version as of May 22, 2018 is 3.9.1. 截至2018年5月22日,当前的CoreNLP版本为3.9.1。

After downloading the two files, extract the zip file anywhere you like. 下载两个文件后,将zip文件解压缩到您喜欢的任何位置。

Python Code Python代码

Next, load the model and use it through NLTK 接下来,加载模型并通过NLTK使用它

from nltk.parse.stanford import StanfordDependencyParser

path_to_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser.jar'
path_to_models_jar = 'path_to/stanford-parser-full-2014-08-27/stanford-parser-3.4.1-models.jar'

dependency_parser = StanfordDependencyParser(path_to_jar=path_to_jar, path_to_models_jar=path_to_models_jar)

result = dependency_parser.raw_parse('I shot an elephant in my sleep')
dep = result.next()

list(dep.triples())

Output 输出量

The output of the last line is: 最后一行的输出是:

[((u'shot', u'VBD'), u'nsubj', (u'I', u'PRP')),
 ((u'shot', u'VBD'), u'dobj', (u'elephant', u'NN')),
 ((u'elephant', u'NN'), u'det', (u'an', u'DT')),
 ((u'shot', u'VBD'), u'prep', (u'in', u'IN')),
 ((u'in', u'IN'), u'pobj', (u'sleep', u'NN')),
 ((u'sleep', u'NN'), u'poss', (u'my', u'PRP$'))]

I think this is what you want. 我想这就是你想要的。

If you need better performance, then spacy ( https://spacy.io/ ) is the best choice. 如果您需要更好的性能,那么spacy( https://spacy.io/ )是最佳选择。 Usage is very simple: 用法很简单:

import spacy

nlp = spacy.load('en')
sents = nlp(u'A woman is walking through the door.')

You'll get a dependency tree as output, and you can dig out very easily every information you need. 您将获得一个依赖关系树作为输出,并且可以非常轻松地挖掘出所需的每条信息。 You can also define your own custom pipelines. 您还可以定义自己的自定义管道。 See more on their website. 在他们的网站上查看更多。

https://spacy.io/docs/usage/ https://spacy.io/docs/usage/

I think you could use a corpus-based dependency parser instead of the grammar-based one NLTK provides. 我认为您可以使用基于语料库的依赖解析器,而不是NLTK提供的基于语法的解析器。

Doing corpus-based dependency parsing on a even a small amount of text in Python is not ideal performance-wise. 在Python中,即使对少量文本进行基于语料库的依存关系解析也不是理想的性能。 So in NLTK they do provide a wrapper to MaltParser , a corpus based dependency parser. 因此,在NLTK中,它们确实为MaltParser (基于语料库的依赖项解析器)提供了包装

You might find this other question about RDF representation of sentences relevant. 您可能会发现有关句子的RDF表示的其他问题。

If you want to be serious about dependance parsing don't use the NLTK, all the algorithms are dated, and slow. 如果您想认真对待依赖关系解析,请不要使用NLTK,因为所有算法都是过时且缓慢的。 Try something like this: https://spacy.io/ 尝试这样的事情: https : //spacy.io/

To use Stanford Parser from NLTK 要使用NLTK的Stanford Parser

1) Run CoreNLP Server at localhost 1)在本地主机上运行CoreNLP Server
Download Stanford CoreNLP here (and also model file for your language). 在此处下载Stanford CoreNLP (以及适用于您的语言的模型文件)。 The server can be started by running the following command (more details here ) 可以通过运行以下命令来启动服务器( 此处有更多详细信息)

# Run the server using all jars in the current directory (e.g., the CoreNLP home directory)
java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 15000

or by NLTK API (need to configure the CORENLP_HOME environment variable first) 或通过NLTK API(需要先配置CORENLP_HOME环境变量)

os.environ["CORENLP_HOME"] = "dir"
client = corenlp.CoreNLPClient()
# do something
client.stop()

2) Call the dependency parser from NLTK 2)从NLTK调用依赖性解析器

>>> from nltk.parse.corenlp import CoreNLPDependencyParser
>>> dep_parser = CoreNLPDependencyParser(url='http://localhost:9000')
>>> parse, = dep_parser.raw_parse(
...     'The quick brown fox jumps over the lazy dog.'
... )
>>> print(parse.to_conll(4))  
The     DT      4       det
quick   JJ      4       amod
brown   JJ      4       amod
fox     NN      5       nsubj
jumps   VBZ     0       ROOT
over    IN      9       case
the     DT      9       det
lazy    JJ      9       amod
dog     NN      5       nmod
.       .       5       punct

See detail documentation here , also this question NLTK CoreNLPDependencyParser: Failed to establish connection . 请参阅此处的详细文档 ,以及该问题。NLTK CoreNLPDependencyParser:无法建立连接

From the Stanford Parser documentation: "the dependencies can be obtained using our software [...] on phrase-structure trees using the EnglishGrammaticalStructure class available in the parser package." 从斯坦福解析器文档中:“可以使用解析器包中提供的EnglishGrammaticalStructure类,使用我们的软件在短语结构树上获得依赖关系。 http://nlp.stanford.edu/software/stanford-dependencies.shtml http://nlp.stanford.edu/software/stanford-dependencies.shtml

The dependencies manual also mentions: "Or our conversion tool can convert the output of other constituency parsers to the Stanford Dependencies representation." 依赖关系手册还提到:“或者我们的转换工具可以将其他选区解析器的输出转换为斯坦福依赖关系表示形式。” http://nlp.stanford.edu/software/dependencies_manual.pdf http://nlp.stanford.edu/software/dependencies_manual.pdf

Neither functionality seem to be implemented in NLTK currently. 目前,NLTK中似乎都未实现任何功能。

A little late to the party, but I wanted to add some example code with SpaCy that gets you your desired output: 晚会晚了一点,但我想在SpaCy中添加一些示例代码,以获取所需的输出:

import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("I shot an elephant in my sleep")
for token in doc:
    print("{2}({3}-{6}, {0}-{5})".format(token.text, token.tag_, token.dep_, token.head.text, token.head.tag_, token.i+1, token.head.i+1))

And here's the output, very similar to your desired output: 这是输出,非常类似于您想要的输出:

nsubj(shot-2, I-1)
ROOT(shot-2, shot-2)
det(elephant-4, an-3)
dobj(shot-2, elephant-4)
prep(shot-2, in-5)
poss(sleep-7, my-6)
pobj(in-5, sleep-7)

Hope that helps! 希望有帮助!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM