简体   繁体   中英

How do I extract noun phrases in Danish using StanfordNLP in python?

I have so far used the stanfordnlp library in python and I have tokenized and POS tagged a dataframe of text. I would now like to try to extract noun phrases. I have tried two different things, and I am having probles with both:

  1. From what I can see, the stanfordnlp python library doesn't seem to offer NP chunking out of the box, at least I haven't been able to find a way to do it. I have tried making a new dataframe of all words in order with their POS tags, and then checking if nouns are repeated. However, this is very crude and quite complicated for me.

  2. I have been able to do it with English text using nltk, so I have also tried to use the Stanford CoreNLP API in NLTK. My problem in this regard is that I need a Danish model when setting CoreNLP up with Maven (which I am very inexperienced with). For problem 1 of this text, I have been using the Danish model found here . This doesn't seem to be the kind of model I am asked to find - again, I don't exactly now what I am doing so apologies if I am misunderstanding something here.

My questions then are (1) whether it is in fact possible to do chunking of NPs in stanfordnlp in python, (2) whether I can somehow parse the POS-tagged+tokenized+lemmatized words from stanfordnlp to NLTK and do the chunking there, or (3) whether it is possible to set up CoreNLP in Danish and then use the CoreNLP api witih NLTK.

Thank you, and apologies for my lack of clarity here.

The way that you can extract chunks from CoreNLP is by using the output of constituency parser . As far as I know, there is no method in CoreNLP that can directly give you a list of chunks, however, you can parse the output of constituency parser, the actual string, and list the chunks based on your needs. For example, for an input sentence like " I bought the book because I read good reviews about it. ", the output of your method would be something like:

<class 'list'>: 
[['NP', 'I'], 
['NP', 'the book'], 
['NP', 'I'], 
['NP', 'good reviews'],
['NP', 'it'], 
['SBAR', 'because I read good reviews about it'], 
['VP', 'bought the book because I read good reviews about it'], 
['VP', 'read good reviews about it']]

The output above is from a method I've written myself, I only listed NPs, VPs, and SBARs here, but haven't published the method yet since I need to further test and debug it.

And, if you only need the noun phrases, you may also want to look at Spacy and the solution here which is pretty fast. Everything I mentioned is mainly regarding your first question and partly your second question and I do not know whether these solutions apply to Danish as well or not.

Some helpful info:

1.) To the best of my knowledge Stanford CoreNLP (Java) has no support for Danish. We don't have Danish support, and I am unaware of a third-party that has models for Danish. So neither the Java code nor server would be of much help. Though it is certainly possible someone somewhere has some Danish models. I'll try researching on Google a little more.

2.) We do have Danish support for tokenization, part-of-speech, lemma, and dependency parsing for the StanfordNLP (Python) codebase. At this time we don't have any noun phrase identifying software. We don't produce a constituency parse, so we can't just find an NP in a parse tree, it's a dependency parse. I would imagine there are decent techniques for extracting noun phrases based off of dependency parses or based off of chunking part-of-speech. We can work on adding some functionality to help with this. Though such a technique might not be perfect to start out with. But the spirit of UD 2.0 is to handle all languages, so this seems like a perfect case to write generic noun-phrase extraction rules over UD 2.0 parses that would then work on all 70+ languages we have support for in the Python package.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM