繁体   English   中英

使用两个数据框如何比较查找值作为 substring 在另一个 dataframe 的列中创建一个新列,如果匹配存在

[英]Using two dataframes how can I compare a lookup value as a substring in the column in another dataframe to create a new column if the match exists

我正在尝试使用两个数据帧,一个作为查找表来查找与我的数据集数据帧列中的值匹配的 substring。 找到该值后,我想使用该值创建一个新列并遍历整个列,并从初始列中删除匹配的 substring 并循环直到没有更多匹配项。

我遇到的第一个问题是我无法匹配或返回匹配的值,除非它是一个精确的字符串。 棘手的部分有时是成分名称包含单个成分的多个单词。

这是我的代码的较小示例,注释部分包括错误或我尝试过的问题:

import pandas as pd

singleingredientdata = {
    'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}

multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}

df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()



df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
        col_name = 'Ingredient_Name' + str(iterator)
#        contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
#        df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
        iterator += 1 

print(df2)

理想情况下,我的结果如下所示:

Ingredient_Name  Ingredient_Name1 Igredient_Name2       Ingredient_Name3  Ingredient_Name4
                 ACETAMINOPHEN    ACETYLSALICYLIC ACID  CAFFEINE
                 ACEBUTOLOL       ACETYLSALICYLIC ACID 
                 COLISTIN         HYDROCORTISONE        NEOMYCIN          THONZONIUM BROMIDE
                 BROMIDE

原始成分名称将包含在查找中未找到的任何值,在此示例中没有。

到目前为止,我试图在成分上得到匹配的是以下我已经包含了错误消息和该代码行的问题:

 df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
    ##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
    #df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
    ##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern

我能够在确切字符串上匹配的部分返回以下结果,但我想返回值而不是 true/fase 并在 substring 上匹配而不是完全匹配:

                                     Ingredient_Name  WordCount  Num_Of_Ingredients  Exists
0        ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   False
1                    ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   False
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR...          5                   4   False
3                                            BROMIDE          1                   1    True

也许我以错误的方式解决这个问题,或者也许我很接近,但我没有掌握一些东西。 您可以提供任何帮助以引导我朝着正确的方向前进,我很感激!

我不完全理解你真正想要什么,但也许这可以帮助你?

pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' + (out.columns + 1).astype(str)
out = df2.join(out)
print(out)

# Output:
                                       Ingredient_Name  WordCount  Num_Of_Ingredients  \
0          ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   
1                      ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE          5                   4   
3                                              BROMIDE          1                   1   

  Ingredient_Name_1     Ingredient_Name_2 Ingredient_Name_3   Ingredient_Name_4  
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN  
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN  
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE  
3           BROMIDE                   NaN               NaN                 NaN  
  1. 使用str.extractall获取所有匹配项
  2. unstack以转换为单个列
output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()

#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")

>>> output
  Ingredient_Name_0     Ingredient_Name_1 Ingredient_Name_2   Ingredient_Name_3
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE
3           BROMIDE                   NaN               NaN                 NaN

为了保持一列无与伦比的成分,我能想到的最好的就是这个。 如果不匹配的成分不那么重要,那么最好使用其他答案中提到的其他内置函数进行字符串和模式匹配。 这可能不是最有效的方法。

def match_ingredients(row, df):
  base_str = row['Ingredient_Name']
  result_count = 1
  result = {}
  for idx, ingredient in df.iterrows():
    if ingredient['Ingredient_Name'] in base_str:
        result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
        result_count += 1
        base_str = base_str.replace(ingredient['Ingredient_Name'], "")
  result['Ingredient_Name'] = base_str

  return result

result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))

df2.apply(match_ingredients)df2的每一行上执行 function 并将 function 的行类型响应合并到另一个 Z6A8064B5DF479455500Z53C7 中它将df1作为 aa 参数,以便我们可以遍历每种成分(也可以将其修改为成分列表),并且可以in本机 Python 中用作 substring 检查。 如果字符串在总成分列表中,那么我们使用replace从总成分列表中“减去”它。

这里的另一件事是返回的字典将其键视为列名,因此我们可以将剩余的基本字符串(在替换所有匹配的字符串之后)分配给常量列名Ingredient Name

result_type = 'expand'意味着 function 的响应将尽可能转换为多列。

申请文件。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM