简体   繁体   中英

Pandas extract substring based on another column

I have 2 dataframes. Below is first df:

df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']} 
df1=pd.DataFrame(df1)

And second df:

df2={"columnB":['lemon','cherry']}
df2=pd.DataFrame(df2)

I already got all values in df1 that appear in df2. I'm using below code to filter:

words = [rf'\b{string}\b' for string in df2.columnB]
df1[df1['columnA'].str.contains('|'.join(words))]

and I got below:

在此处输入图片说明

So the next step I want to do is to remove all unwanted substring from the above result like this:

在此处输入图片说明

Please let me know how can i achieve this?

I think you need a separate method to be applied to the DataFrame:

def keep_words(cell, df):
    words = cell.split(',')
    result = []
    for word in words:
         if word.strip() in list(df.columnB):
              result.append(word)
    return ','.join(result)

words = [rf'\b{string}\b' for string in df2.columnB]
df1 = df1[df1['columnA'].str.contains('|'.join(words))]
df3 = df1.columnA.apply(lambda x: keep_words(x, df2))

Since it takes quiet a few steps to go through, define a separate method (keep_words), which takes in the string inside each cell, and the DataFrame with the accepted words, compares each word in the string against the "list" of accepted words and returns eligible ones.

I am not sure about the performance in bigger DataFrames though.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM