Pandas extract substring based on another column

Question

I have 2 dataframes. Below is first df:

df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']} 
df1=pd.DataFrame(df1)

And second df:

df2={"columnB":['lemon','cherry']}
df2=pd.DataFrame(df2)

I already got all values in df1 that appear in df2. I'm using below code to filter:

words = [rf'\b{string}\b' for string in df2.columnB]
df1[df1['columnA'].str.contains('|'.join(words))]

and I got below:

So the next step I want to do is to remove all unwanted substring from the above result like this:

Please let me know how can i achieve this?

Answer 1

I think you need a separate method to be applied to the DataFrame:

def keep_words(cell, df):
    words = cell.split(',')
    result = []
    for word in words:
         if word.strip() in list(df.columnB):
              result.append(word)
    return ','.join(result)

words = [rf'\b{string}\b' for string in df2.columnB]
df1 = df1[df1['columnA'].str.contains('|'.join(words))]
df3 = df1.columnA.apply(lambda x: keep_words(x, df2))

Since it takes quiet a few steps to go through, define a separate method (keep_words), which takes in the string inside each cell, and the DataFrame with the accepted words, compares each word in the string against the "list" of accepted words and returns eligible ones.

I am not sure about the performance in bigger DataFrames though.

Pandas extract substring based on another column

Question

1 answers

solution1
0 ACCPTED 2020-08-28 20:33:43

Pandas extract substring based on another column

Question

1 answers

solution1 0 ACCPTED 2020-08-28 20:33:43

solution1
0 ACCPTED 2020-08-28 20:33:43