简体   繁体   中英

Pandas extract substring from column of string

I have the following dataframes

df1:
Name
0   AAa
1   BBB
2   ccc

df2:
Description
0   text AAa clinic text
1   text bbb hospital text

I want to add another column to df2 that extracts Name from Description, but the Name has to be followed by either 'clinic' or 'hospital'. So if Description is "text AAa text", I don't want AAa to be extracted

I feel like this should be straightforward but for some reason I am stuck and can't find a solution

I have tried the following but it returns df2['Extracted Name'] all None

def df_matcher(x):
    for i in df1['Name']:
        if ((i.lower() + " clinic" in x.lower()) or (i.lower() + " hospital" in x.lower())):
            return i

df2['Extracted Name'] = df2['Description'].apply(df_matcher)

Thanks!

Let's use str.extract with a regex pattern:

pat = r'(?i)\b(%s)\s(?:clinic|hospital)' % '|'.join(df1.Name)
df2['col'] = df2['Description'].str.extract(pat)

              Description  col
0    text AAa clinic text  AAa
1  text bbb hospital text  bbb

If the part you want to remove appears exactly before clinic or hospital, you don't need the first dataframe and you can try this:

df2['split'] = df2.Description.str.split()
df2['index'] = df2['split'].apply(lambda x:x.index('clinic') if 'clinic' in x else x.index('hospital'))

df2['remove'] = df2.apply(lambda x:x['spilit'][x['index'] - 1], axis=1)
df2['Extracted Name'] = df2.apply(lambda x:x['Description'].replace(x['remove'], ''), axis=1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM