简体   繁体   中英

Using two dataframes how can I compare a lookup value as a substring in the column in another dataframe to create a new column if the match exists

I'm attempting to use two dataframes, one as a lookup table to find a substring match on the value in my datasets dataframes column. After I find the value, I'd like to create a new column with that value and iterate through the entire column and remove the matched substring from the initial column and loop through until there are no more matches.

The first problem I'm having is I'm not able to match or return the value of the match unless it's an exact string. The tricky part is sometimes the Ingredient_Name contains multiple words for a single ingredient.

This is a smaller sample of my code, the commented sections include the error or the problem with what I tried:

import pandas as pd

singleingredientdata = {
    'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}

multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}

df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()



df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
        col_name = 'Ingredient_Name' + str(iterator)
#        contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
#        df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
        iterator += 1 

print(df2)

Ideally my results would look like this:

Ingredient_Name  Ingredient_Name1 Igredient_Name2       Ingredient_Name3  Ingredient_Name4
                 ACETAMINOPHEN    ACETYLSALICYLIC ACID  CAFFEINE
                 ACEBUTOLOL       ACETYLSALICYLIC ACID 
                 COLISTIN         HYDROCORTISONE        NEOMYCIN          THONZONIUM BROMIDE
                 BROMIDE

The original Ingredient_Name would contain any values that were not found in the lookup, in this example there are none.

What I've attempted to get the match on the ingredients so far is the following I've included the error messages and the issue with that line of code:

 df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
    ##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
    #df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
    ##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern

The part I'm able to match on the exact string returns the following results, but I'd like to return the value instead of true/fase and match on the substring not the exact match:

                                     Ingredient_Name  WordCount  Num_Of_Ingredients  Exists
0        ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   False
1                    ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   False
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR...          5                   4   False
3                                            BROMIDE          1                   1    True

Perhaps I'm going about this problem the wrong way, or maybe I'm close, but I'm not grasping something. Any help that you can offer to lead me in the right direction, I appreciate it!

I don't fully understand what you really want but maybe this could help you?

pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' + (out.columns + 1).astype(str)
out = df2.join(out)
print(out)

# Output:
                                       Ingredient_Name  WordCount  Num_Of_Ingredients  \
0          ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   
1                      ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE          5                   4   
3                                              BROMIDE          1                   1   

  Ingredient_Name_1     Ingredient_Name_2 Ingredient_Name_3   Ingredient_Name_4  
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN  
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN  
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE  
3           BROMIDE                   NaN               NaN                 NaN  
  1. Use str.extractall to get all matches
  2. unstack to convert to individual columns
output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()

#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")

>>> output
  Ingredient_Name_0     Ingredient_Name_1 Ingredient_Name_2   Ingredient_Name_3
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE
3           BROMIDE                   NaN               NaN                 NaN

In order to maintain a column of unmatched ingredients, the best I could come up with was this. If unmatched ingredients aren't as important you're better off using the other inbuilt functions for string and pattern matching mentioned in the other answers. This is probably not the most efficient way to do this.

def match_ingredients(row, df):
  base_str = row['Ingredient_Name']
  result_count = 1
  result = {}
  for idx, ingredient in df.iterrows():
    if ingredient['Ingredient_Name'] in base_str:
        result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
        result_count += 1
        base_str = base_str.replace(ingredient['Ingredient_Name'], "")
  result['Ingredient_Name'] = base_str

  return result

result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))

df2.apply(match_ingredients) Executes the function over each row of df2 and combines the row type response of the function into another dataframe. It takes df1 as aa parameter so that we can iterate over every ingredient (This can be modified to a list of ingredients as well) and in can be used as a substring check in native Python. If the string is inside the total ingredient list then we use replace to "subtract" it from the total list of ingredients.

The other thing here is that the dictionary returned will have its keys treated as column names so we can assign the remaining base string (after replacing all matching strings) to the constant column name Ingredient Name .

result_type = 'expand' implies that the response of the function is to be turned into multiple columns if possible.

Docs for apply .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM