简体   繁体   English

如何 select 仅从 spacy 实体中提取第一个实体?

[英]How to select only first entity extracted from spacy entities?

I am trying to using following code to extract entities from text available in DataFrame.我正在尝试使用以下代码从 DataFrame 中可用的文本中提取实体。

for i in df['Text'].to_list():

    doc = nlp(i)
    for entity in doc.ents:
        if entity.label_ == 'GPE':

I need to store text of first GPE with it's corresponding column of text.我需要存储第一个GPE的文本及其对应的文本列。 Like for instance if following is text at index 0 in column df['Text']例如,如果以下是df['Text']列中索引 0 处的文本

Match between USA and Canada was postponed美国和加拿大的比赛被推迟

then I need only first location(USA) in another column such as df['Place'] at the corresponding index to Text which is 0. df['Place'] is not already available in DataFrame means it will be created while assigning value.然后我只需要在另一列中的第一个位置(USA),例如df['Place']在对应的 Text 索引处,该索引为 0。 df['Place']在 DataFrame 中尚不可用意味着它将在分配值时创建. I have tried following code.我试过下面的代码。 But it fills whole column with very first value it can find.但是它会用它能找到的第一个值填充整列。

for i in df['Text'].to_list():

    doc = nlp(i)
    for entity in doc.ents:
        if entity.label_ == 'GPE':
            df['Place'] = (entity.text)

I have also tried to append text in list with e_list.append((entity.text)) but it will append all entities it can find in text.我也尝试过使用e_list.append((entity.text))列表中的文本,但它将 append 它可以在文本中找到的所有实体。 Can someone help that how can I store only first entity only at corresponding index.有人可以帮助我如何只在相应的索引处存储第一个实体。 Thank you谢谢

You can get all the entities per each entry using Series.apply on the Text column like您可以在Text列上使用Series.apply获取每个条目的所有实体,例如

df['Place'] = df['Text'].apply(lambda x: [entity.text for entity in nlp(x).ents if entity.label_ == 'GPE'])

If you are only interested in getting the first entity only from each entry use如果您只想从每个条目中获取第一个实体,请使用

df['Text'].apply(lambda x: ([entity.text for entity in nlp(x).ents if entity.label_ == 'GPE'] or [''])[0])

Here is a test snippet:这是一个测试片段:

import spacy
import pandas as pd
df = pd.DataFrame({'Text':['Match between USA and Canada was postponed', 'No ents']})
df['Text'].apply(lambda x: [entity.text for entity in nlp(x).ents if entity.label_ == 'GPE'])
# => 0    [USA, Canada]
#    1               []
#    Name: Text, dtype: object
df['Text'].apply(lambda x: ([entity.text for entity in nlp(x).ents if entity.label_ == 'GPE'] or [''])[0])
# => 0    USA
#    1       
#    Name: Text, dtype: object

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM