简体   繁体   English

Pandas 从字符串列中提取 substring

[英]Pandas extract substring from column of string

I have the following dataframes我有以下数据框

df1:
Name
0   AAa
1   BBB
2   ccc

df2:
Description
0   text AAa clinic text
1   text bbb hospital text

I want to add another column to df2 that extracts Name from Description, but the Name has to be followed by either 'clinic' or 'hospital'.我想向 df2 添加另一列,从描述中提取名称,但名称后必须跟“诊所”或“医院”。 So if Description is "text AAa text", I don't want AAa to be extracted因此,如果 Description 是“文本 AAa 文本”,我不想提取 AAa

I feel like this should be straightforward but for some reason I am stuck and can't find a solution我觉得这应该很简单,但由于某种原因我被卡住了,找不到解决方案

I have tried the following but it returns df2['Extracted Name'] all None我尝试了以下但它返回 df2['Extracted Name'] all None

def df_matcher(x):
    for i in df1['Name']:
        if ((i.lower() + " clinic" in x.lower()) or (i.lower() + " hospital" in x.lower())):
            return i

df2['Extracted Name'] = df2['Description'].apply(df_matcher)

Thanks!谢谢!

Let's use str.extract with a regex pattern:让我们使用带有正则表达式模式的str.extract

pat = r'(?i)\b(%s)\s(?:clinic|hospital)' % '|'.join(df1.Name)
df2['col'] = df2['Description'].str.extract(pat)

              Description  col
0    text AAa clinic text  AAa
1  text bbb hospital text  bbb

If the part you want to remove appears exactly before clinic or hospital, you don't need the first dataframe and you can try this:如果您要删除的部分恰好出现在 clinic 或 hospital 之前,则不需要第一个dataframe ,您可以尝试以下操作:

df2['split'] = df2.Description.str.split()
df2['index'] = df2['split'].apply(lambda x:x.index('clinic') if 'clinic' in x else x.index('hospital'))

df2['remove'] = df2.apply(lambda x:x['spilit'][x['index'] - 1], axis=1)
df2['Extracted Name'] = df2.apply(lambda x:x['Description'].replace(x['remove'], ''), axis=1)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM