使用两个数据框如何比较查找值作为 substring 在另一个 dataframe 的列中创建一个新列，如果匹配存在

Question

I'm attempting to use two dataframes, one as a lookup table to find a substring match on the value in my datasets dataframes column.我正在尝试使用两个数据帧，一个作为查找表来查找与我的数据集数据帧列中的值匹配的 substring。 After I find the value, I'd like to create a new column with that value and iterate through the entire column and remove the matched substring from the initial column and loop through until there are no more matches.找到该值后，我想使用该值创建一个新列并遍历整个列，并从初始列中删除匹配的 substring 并循环直到没有更多匹配项。

The first problem I'm having is I'm not able to match or return the value of the match unless it's an exact string.我遇到的第一个问题是我无法匹配或返回匹配的值，除非它是一个精确的字符串。 The tricky part is sometimes the Ingredient_Name contains multiple words for a single ingredient.棘手的部分有时是成分名称包含单个成分的多个单词。

This is a smaller sample of my code, the commented sections include the error or the problem with what I tried:这是我的代码的较小示例，注释部分包括错误或我尝试过的问题：

import pandas as pd

singleingredientdata = {
    'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}

multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}

df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()



df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
        col_name = 'Ingredient_Name' + str(iterator)
#        contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
#        df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
        iterator += 1 

print(df2)

Ideally my results would look like this:理想情况下，我的结果如下所示：

Ingredient_Name  Ingredient_Name1 Igredient_Name2       Ingredient_Name3  Ingredient_Name4
                 ACETAMINOPHEN    ACETYLSALICYLIC ACID  CAFFEINE
                 ACEBUTOLOL       ACETYLSALICYLIC ACID 
                 COLISTIN         HYDROCORTISONE        NEOMYCIN          THONZONIUM BROMIDE
                 BROMIDE

The original Ingredient_Name would contain any values that were not found in the lookup, in this example there are none.原始成分名称将包含在查找中未找到的任何值，在此示例中没有。

What I've attempted to get the match on the ingredients so far is the following I've included the error messages and the issue with that line of code:到目前为止，我试图在成分上得到匹配的是以下我已经包含了错误消息和该代码行的问题：

 df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
    ##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
    #df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
    ##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern

The part I'm able to match on the exact string returns the following results, but I'd like to return the value instead of true/fase and match on the substring not the exact match:我能够在确切字符串上匹配的部分返回以下结果，但我想返回值而不是 true/fase 并在 substring 上匹配而不是完全匹配：

                                     Ingredient_Name  WordCount  Num_Of_Ingredients  Exists
0        ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   False
1                    ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   False
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR...          5                   4   False
3                                            BROMIDE          1                   1    True

Perhaps I'm going about this problem the wrong way, or maybe I'm close, but I'm not grasping something.也许我以错误的方式解决这个问题，或者也许我很接近，但我没有掌握一些东西。 Any help that you can offer to lead me in the right direction, I appreciate it!您可以提供任何帮助以引导我朝着正确的方向前进，我很感激！

Answer 1

I don't fully understand what you really want but maybe this could help you?我不完全理解你真正想要什么，但也许这可以帮助你？

pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' + (out.columns + 1).astype(str)
out = df2.join(out)
print(out)

# Output:
                                       Ingredient_Name  WordCount  Num_Of_Ingredients  \
0          ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   
1                      ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE          5                   4   
3                                              BROMIDE          1                   1   

  Ingredient_Name_1     Ingredient_Name_2 Ingredient_Name_3   Ingredient_Name_4  
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN  
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN  
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE  
3           BROMIDE                   NaN               NaN                 NaN

Answer 2

Use str.extractall to get all matches使用str.extractall获取所有匹配项
unstack to convert to individual columns unstack以转换为单个列

output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()

#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")

>>> output
  Ingredient_Name_0     Ingredient_Name_1 Ingredient_Name_2   Ingredient_Name_3
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE
3           BROMIDE                   NaN               NaN                 NaN

Answer 3

In order to maintain a column of unmatched ingredients, the best I could come up with was this.为了保持一列无与伦比的成分，我能想到的最好的就是这个。 If unmatched ingredients aren't as important you're better off using the other inbuilt functions for string and pattern matching mentioned in the other answers.如果不匹配的成分不那么重要，那么最好使用其他答案中提到的其他内置函数进行字符串和模式匹配。 This is probably not the most efficient way to do this.这可能不是最有效的方法。

def match_ingredients(row, df):
  base_str = row['Ingredient_Name']
  result_count = 1
  result = {}
  for idx, ingredient in df.iterrows():
    if ingredient['Ingredient_Name'] in base_str:
        result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
        result_count += 1
        base_str = base_str.replace(ingredient['Ingredient_Name'], "")
  result['Ingredient_Name'] = base_str

  return result

result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))

df2.apply(match_ingredients) Executes the function over each row of df2 and combines the row type response of the function into another dataframe. df2.apply(match_ingredients)在df2的每一行上执行 function 并将 function 的行类型响应合并到另一个 Z6A8064B5DF479455500Z53C7 中It takes df1 as aa parameter so that we can iterate over every ingredient (This can be modified to a list of ingredients as well) and in can be used as a substring check in native Python.它将df1作为 aa 参数，以便我们可以遍历每种成分（也可以将其修改为成分列表），并且可以in本机 Python 中用作 substring 检查。 If the string is inside the total ingredient list then we use replace to "subtract" it from the total list of ingredients.如果字符串在总成分列表中，那么我们使用replace从总成分列表中“减去”它。

The other thing here is that the dictionary returned will have its keys treated as column names so we can assign the remaining base string (after replacing all matching strings) to the constant column name Ingredient Name .这里的另一件事是返回的字典将其键视为列名，因此我们可以将剩余的基本字符串（在替换所有匹配的字符串之后）分配给常量列名Ingredient Name 。

result_type = 'expand' implies that the response of the function is to be turned into multiple columns if possible. result_type = 'expand'意味着 function 的响应将尽可能转换为多列。

Docs for apply . 申请文件。

使用两个数据框如何比较查找值作为 substring 在另一个 dataframe 的列中创建一个新列，如果匹配存在

问题描述

3 个解决方案

解决方案1
1 2021-12-17 15:39:45

解决方案2
1 2021-12-17 15:47:48

解决方案3
1 已采纳 2021-12-17 15:50:04

使用两个数据框如何比较查找值作为 substring 在另一个 dataframe 的列中创建一个新列，如果匹配存在

问题描述

3 个解决方案

解决方案1 1 2021-12-17 15:39:45

解决方案2 1 2021-12-17 15:47:48

解决方案3 1 已采纳 2021-12-17 15:50:04

解决方案1
1 2021-12-17 15:39:45

解决方案2
1 2021-12-17 15:47:48

解决方案3
1 已采纳 2021-12-17 15:50:04