简体   繁体   English

如何使用 SpaCy 从句子列表中获取名词短语

[英]How to get noun phrases from a list of sentences using SpaCy

I have a list of sentences that need to find the noun phrases for each sentence using SpaCy.我有一个句子列表,需要使用 SpaCy 为每个句子找到名词短语。 Currently, the outputs only append all noun phrases from all of the sentences.目前,输出仅附加所有句子中的所有名词短语。 How can I get the noun phrases for each sentence and print as a list of lists?如何获取每个句子的名词短语并打印为列表列表?

say we have two elements of sentences in a list -假设我们在列表中有两个句子元素 -

A = ["I am a boy", "I am a girl"]

A_np = []
for x in A:
    doc = nlp(x)
    for np in doc.noun_chunks:
        story_np.append(np.text)
A_np

I am expecting to get something like this:我期待得到这样的东西:

[['I','boy'],['I','girl']]

You need to do two improvizations:你需要做两个即兴表演:

1/ noun_chunks are spans, not tokens. 1/ noun_chunks 是跨度,而不是标记。 Hence better to iterate over individual tokens of a noun chunk.因此最好迭代名词块的单个标记。

2/ You need to have an intermediate list to store noun chunks of a single sentence. 2/ 您需要有一个中间列表来存储单个句子的名词块。

Improvised code, you can adjust it as per your requirement :即兴代码,您可以根据您的要求进行调整:

>>> A = ["I am a boy", "I am a girl"]
>>> nlp = spacy.load('en')
>>> A_np = []
>>> for x in A:
...     doc = nlp(x)
...     sent_nps = []
...     for np in doc.noun_chunks:
...             sent_nps.extend([token.text for token in np])
...     A_np.append(sent_nps)
...
>>> A_np
[['I', 'a', 'boy'], ['I', 'a', 'girl']]

I figure it out by adding an empty list before the second loop and inserting doc chunks to the last element of the empty list.我通过在第二个循环之前添加一个空列表并将文档块插入到空列表的最后一个元素来解决这个问题。 These two loops will keep phrasing noun phrases and inserting the processed noun phrases.这两个循环将继续对名词短语进行分词并插入处理过的名词短语。

A = ["I am a boy", "I am a girl"]

A_np = []
    for x in A:
        doc = nlp(x)
        A_np.append([])
        for np in doc.noun_chunks:
            story_np[-1].append(np.text)
A_np

After creating the list of words from the sentences and removing the noise and stop words, bringing all of then to same cases, you will have a set of words left in the data.在从句子中创建单词列表并去除干扰词和停用词后,将所有这些都带到相同的情况下,您将在数据中留下一组单词。

Then you can call the library然后你可以调用库

nlp = spacy.load('en', disable=['parser', 'ner'])

or like或喜欢

nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

then you can def a function to filter out the noun words like:那么你可以定义一个函数来过滤掉名词词,如:

def filter_nouns(texts, tags=['NOUN']):
       output = []
       for x in texts:
             doc = nlp(" ".join(x)) 
             output.append([token.lemma_ for token in doc if token.pos_ in tags])
       return output

then you can apply the defined function on the cleaned data然后你可以在清理过的数据上应用定义的函数

I hope it proves useful我希望它被证明是有用的

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM