I have 2 dataframes. Below is first df:
df1={"columnA":['apple,cherry','pineple,lemon','banana, pear','cherry, pear, lemon']}
df1=pd.DataFrame(df1)
And second df:
df2={"columnB":['lemon','cherry']}
df2=pd.DataFrame(df2)
I already got all values in df1 that appear in df2. I'm using below code to filter:
words = [rf'\b{string}\b' for string in df2.columnB]
df1[df1['columnA'].str.contains('|'.join(words))]
and I got below:
So the next step I want to do is to remove all unwanted substring from the above result like this:
Please let me know how can i achieve this?
I think you need a separate method to be applied to the DataFrame:
def keep_words(cell, df):
words = cell.split(',')
result = []
for word in words:
if word.strip() in list(df.columnB):
result.append(word)
return ','.join(result)
words = [rf'\b{string}\b' for string in df2.columnB]
df1 = df1[df1['columnA'].str.contains('|'.join(words))]
df3 = df1.columnA.apply(lambda x: keep_words(x, df2))
Since it takes quiet a few steps to go through, define a separate method (keep_words), which takes in the string inside each cell, and the DataFrame with the accepted words, compares each word in the string against the "list" of accepted words and returns eligible ones.
I am not sure about the performance in bigger DataFrames though.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.