Pandas 从字符串列中提取 substring

Question

I have the following dataframes我有以下数据框

df1:
Name
0   AAa
1   BBB
2   ccc

df2:
Description
0   text AAa clinic text
1   text bbb hospital text

I want to add another column to df2 that extracts Name from Description, but the Name has to be followed by either 'clinic' or 'hospital'.我想向 df2 添加另一列，从描述中提取名称，但名称后必须跟“诊所”或“医院”。 So if Description is "text AAa text", I don't want AAa to be extracted因此，如果 Description 是“文本 AAa 文本”，我不想提取 AAa

I feel like this should be straightforward but for some reason I am stuck and can't find a solution我觉得这应该很简单，但由于某种原因我被卡住了，找不到解决方案

I have tried the following but it returns df2['Extracted Name'] all None我尝试了以下但它返回 df2['Extracted Name'] all None

def df_matcher(x):
    for i in df1['Name']:
        if ((i.lower() + " clinic" in x.lower()) or (i.lower() + " hospital" in x.lower())):
            return i

df2['Extracted Name'] = df2['Description'].apply(df_matcher)

Thanks!谢谢！

Answer 1

Let's use str.extract with a regex pattern:让我们使用带有正则表达式模式的str.extract ：

pat = r'(?i)\b(%s)\s(?:clinic|hospital)' % '|'.join(df1.Name)
df2['col'] = df2['Description'].str.extract(pat)

              Description  col
0    text AAa clinic text  AAa
1  text bbb hospital text  bbb

Answer 2

If the part you want to remove appears exactly before clinic or hospital, you don't need the first dataframe and you can try this:如果您要删除的部分恰好出现在 clinic 或 hospital 之前，则不需要第一个dataframe ，您可以尝试以下操作：

df2['split'] = df2.Description.str.split()
df2['index'] = df2['split'].apply(lambda x:x.index('clinic') if 'clinic' in x else x.index('hospital'))

df2['remove'] = df2.apply(lambda x:x['spilit'][x['index'] - 1], axis=1)
df2['Extracted Name'] = df2.apply(lambda x:x['Description'].replace(x['remove'], ''), axis=1)

Pandas 从字符串列中提取 substring

问题描述

2 个解决方案

解决方案1
0 2023-01-17 04:35:27

解决方案2
0 2023-01-17 10:46:54

Pandas 从字符串列中提取 substring

问题描述

2 个解决方案

解决方案1 0 2023-01-17 04:35:27

解决方案2 0 2023-01-17 10:46:54

解决方案1
0 2023-01-17 04:35:27

解决方案2
0 2023-01-17 10:46:54