简体   繁体   中英

How to get noun phrases from a list of sentences using SpaCy

I have a list of sentences that need to find the noun phrases for each sentence using SpaCy. Currently, the outputs only append all noun phrases from all of the sentences. How can I get the noun phrases for each sentence and print as a list of lists?

say we have two elements of sentences in a list -

A = ["I am a boy", "I am a girl"]

A_np = []
for x in A:
    doc = nlp(x)
    for np in doc.noun_chunks:
        story_np.append(np.text)
A_np

I am expecting to get something like this:

[['I','boy'],['I','girl']]

You need to do two improvizations:

1/ noun_chunks are spans, not tokens. Hence better to iterate over individual tokens of a noun chunk.

2/ You need to have an intermediate list to store noun chunks of a single sentence.

Improvised code, you can adjust it as per your requirement :

>>> A = ["I am a boy", "I am a girl"]
>>> nlp = spacy.load('en')
>>> A_np = []
>>> for x in A:
...     doc = nlp(x)
...     sent_nps = []
...     for np in doc.noun_chunks:
...             sent_nps.extend([token.text for token in np])
...     A_np.append(sent_nps)
...
>>> A_np
[['I', 'a', 'boy'], ['I', 'a', 'girl']]

I figure it out by adding an empty list before the second loop and inserting doc chunks to the last element of the empty list. These two loops will keep phrasing noun phrases and inserting the processed noun phrases.

A = ["I am a boy", "I am a girl"]

A_np = []
    for x in A:
        doc = nlp(x)
        A_np.append([])
        for np in doc.noun_chunks:
            story_np[-1].append(np.text)
A_np

After creating the list of words from the sentences and removing the noise and stop words, bringing all of then to same cases, you will have a set of words left in the data.

Then you can call the library

nlp = spacy.load('en', disable=['parser', 'ner'])

or like

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

then you can def a function to filter out the noun words like:

def filter_nouns(texts, tags=['NOUN']):
       output = []
       for x in texts:
             doc = nlp(" ".join(x)) 
             output.append([token.lemma_ for token in doc if token.pos_ in tags])
       return output

then you can apply the defined function on the cleaned data

I hope it proves useful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM