简体   繁体   English

如何在python中使用StanfordNLP提取丹麦语中的名词短语?

[英]How do I extract noun phrases in Danish using StanfordNLP in python?

I have so far used the stanfordnlp library in python and I have tokenized and POS tagged a dataframe of text. 到目前为止,我已经在python中使用了stanfordnlp库,并且我已对其进行了标记化和POS标记的文本数据框。 I would now like to try to extract noun phrases. 我现在想尝试提取名词短语。 I have tried two different things, and I am having probles with both: 我尝试了两种不同的东西,我和两者都有问题:

  1. From what I can see, the stanfordnlp python library doesn't seem to offer NP chunking out of the box, at least I haven't been able to find a way to do it. 从我所看到的,stanfordnlp python库似乎没有提供开箱即用的NP分块,至少我没有找到办法做到这一点。 I have tried making a new dataframe of all words in order with their POS tags, and then checking if nouns are repeated. 我尝试用POS标签按顺序创建所有单词的新数据框,然后检查是否重复了名词。 However, this is very crude and quite complicated for me. 然而,这非常粗糙,对我来说非常复杂。

  2. I have been able to do it with English text using nltk, so I have also tried to use the Stanford CoreNLP API in NLTK. 我已经能够使用nltk使用英文文本,所以我也尝试在NLTK中使用Stanford CoreNLP API。 My problem in this regard is that I need a Danish model when setting CoreNLP up with Maven (which I am very inexperienced with). 我在这方面的问题是,在使用Maven设置CoreNLP时我需要一个丹麦模型(我很缺乏经验)。 For problem 1 of this text, I have been using the Danish model found here . 对于本文的问题1,我一直在使用这里找到的丹麦模型。 This doesn't seem to be the kind of model I am asked to find - again, I don't exactly now what I am doing so apologies if I am misunderstanding something here. 这似乎不是我被要求找到的那种模式 - 再说一次,我现在不知道我在做什么,如果我在这里误解了一些东西,那么道歉。

My questions then are (1) whether it is in fact possible to do chunking of NPs in stanfordnlp in python, (2) whether I can somehow parse the POS-tagged+tokenized+lemmatized words from stanfordnlp to NLTK and do the chunking there, or (3) whether it is possible to set up CoreNLP in Danish and then use the CoreNLP api witih NLTK. 我的问题是(1)实际上是否有可能在python中的stanfordnlp中对NP进行分块,(2)我是否可以某种方式将stanfordnlp中的POS标记+标记化+词形化的单词解析为NLTK并在那里进行分块,或者(3)是否可以用丹麦语建立CoreNLP,然后使用CoreNLP api witih NLTK。

Thank you, and apologies for my lack of clarity here. 谢谢你,并为我在这里缺乏明确表示道歉。

The way that you can extract chunks from CoreNLP is by using the output of constituency parser . 从CoreNLP中提取块的方法是使用constituency parser的输出。 As far as I know, there is no method in CoreNLP that can directly give you a list of chunks, however, you can parse the output of constituency parser, the actual string, and list the chunks based on your needs. 据我所知,CoreNLP中没有可以直接给你一个块列表的方法,但是,你可以解析选区解析器的输出,实际的字符串,并根据你的需要列出块。 For example, for an input sentence like " I bought the book because I read good reviews about it. ", the output of your method would be something like: 例如,对于输入句子,例如“ I bought the book because I read good reviews about it. ”,你的方法的输出将是这样的:

<class 'list'>: 
[['NP', 'I'], 
['NP', 'the book'], 
['NP', 'I'], 
['NP', 'good reviews'],
['NP', 'it'], 
['SBAR', 'because I read good reviews about it'], 
['VP', 'bought the book because I read good reviews about it'], 
['VP', 'read good reviews about it']]

The output above is from a method I've written myself, I only listed NPs, VPs, and SBARs here, but haven't published the method yet since I need to further test and debug it. 上面的输出来自我自己编写的方法,我在这里只列出了NP,VP和SBAR,但是还没有发布该方法,因为我需要进一步测试和调试它。

And, if you only need the noun phrases, you may also want to look at Spacy and the solution here which is pretty fast. 而且,如果你只需要名词短语,你可能也想看看Spacy和这里的解决方案非常快。 Everything I mentioned is mainly regarding your first question and partly your second question and I do not know whether these solutions apply to Danish as well or not. 我提到的一切主要是关于你的第一个问题,部分关于你的第二个问题,我不知道这些解决方案是否也适用于丹麦语。

Some helpful info: 一些有用的信息:

1.) To the best of my knowledge Stanford CoreNLP (Java) has no support for Danish. 1.)据我所知,Stanford CoreNLP(Java)不支持丹麦语。 We don't have Danish support, and I am unaware of a third-party that has models for Danish. 我们没有丹麦的支持,而且我不知道有丹麦模特的第三方。 So neither the Java code nor server would be of much help. 因此,Java代码和服务器都不会有太大帮助。 Though it is certainly possible someone somewhere has some Danish models. 虽然某个地方有人可能会有一些丹麦模特。 I'll try researching on Google a little more. 我会尝试更多地研究Google。

2.) We do have Danish support for tokenization, part-of-speech, lemma, and dependency parsing for the StanfordNLP (Python) codebase. 2.)我们确实丹麦支持StanfordNLP(Python)代码库的标记化,词性,引理和依赖性解析。 At this time we don't have any noun phrase identifying software. 目前我们没有任何名词短语识别软件。 We don't produce a constituency parse, so we can't just find an NP in a parse tree, it's a dependency parse. 我们不生成选区解析,所以我们不能只在解析树中找到一个NP ,它是一个依赖解析。 I would imagine there are decent techniques for extracting noun phrases based off of dependency parses or based off of chunking part-of-speech. 我认为有一些不错的技术可以根据依赖性解析或基于分块词性来提取名词短语。 We can work on adding some functionality to help with this. 我们可以努力添加一些功能来帮助解决这个问题。 Though such a technique might not be perfect to start out with. 虽然这种技术可能不是完美的开始。 But the spirit of UD 2.0 is to handle all languages, so this seems like a perfect case to write generic noun-phrase extraction rules over UD 2.0 parses that would then work on all 70+ languages we have support for in the Python package. 但是UD 2.0的精神是处理所有语言,所以这似乎是在UD 2.0解析上编写通用名词短语提取规则的完美案例,这些规则将适用于我们在Python包中支持的所有70多种语言。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM