简体   繁体   English

使用两个数据框如何比较查找值作为 substring 在另一个 dataframe 的列中创建一个新列,如果匹配存在

[英]Using two dataframes how can I compare a lookup value as a substring in the column in another dataframe to create a new column if the match exists

I'm attempting to use two dataframes, one as a lookup table to find a substring match on the value in my datasets dataframes column.我正在尝试使用两个数据帧,一个作为查找表来查找与我的数据集数据帧列中的值匹配的 substring。 After I find the value, I'd like to create a new column with that value and iterate through the entire column and remove the matched substring from the initial column and loop through until there are no more matches.找到该值后,我想使用该值创建一个新列并遍历整个列,并从初始列中删除匹配的 substring 并循环直到没有更多匹配项。

The first problem I'm having is I'm not able to match or return the value of the match unless it's an exact string.我遇到的第一个问题是我无法匹配或返回匹配的值,除非它是一个精确的字符串。 The tricky part is sometimes the Ingredient_Name contains multiple words for a single ingredient.棘手的部分有时是成分名称包含单个成分的多个单词。

This is a smaller sample of my code, the commented sections include the error or the problem with what I tried:这是我的代码的较小示例,注释部分包括错误或我尝试过的问题:

import pandas as pd

singleingredientdata = {
    'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}

multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}

df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()



df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
        col_name = 'Ingredient_Name' + str(iterator)
#        contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
#        df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
        iterator += 1 

print(df2)

Ideally my results would look like this:理想情况下,我的结果如下所示:

Ingredient_Name  Ingredient_Name1 Igredient_Name2       Ingredient_Name3  Ingredient_Name4
                 ACETAMINOPHEN    ACETYLSALICYLIC ACID  CAFFEINE
                 ACEBUTOLOL       ACETYLSALICYLIC ACID 
                 COLISTIN         HYDROCORTISONE        NEOMYCIN          THONZONIUM BROMIDE
                 BROMIDE

The original Ingredient_Name would contain any values that were not found in the lookup, in this example there are none.原始成分名称将包含在查找中未找到的任何值,在此示例中没有。

What I've attempted to get the match on the ingredients so far is the following I've included the error messages and the issue with that line of code:到目前为止,我试图在成分上得到匹配的是以下我已经包含了错误消息和该代码行的问题:

 df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name'])  ##Doesn't flag True unless it finds a single igredient exists in the string
    ##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
    #df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]]  ## passing 4 items instead of a single pass being implied??
    ##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any())  TypeError: first argument must be string or compiled pattern

The part I'm able to match on the exact string returns the following results, but I'd like to return the value instead of true/fase and match on the substring not the exact match:我能够在确切字符串上匹配的部分返回以下结果,但我想返回值而不是 true/fase 并在 substring 上匹配而不是完全匹配:

                                     Ingredient_Name  WordCount  Num_Of_Ingredients  Exists
0        ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   False
1                    ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   False
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR...          5                   4   False
3                                            BROMIDE          1                   1    True

Perhaps I'm going about this problem the wrong way, or maybe I'm close, but I'm not grasping something.也许我以错误的方式解决这个问题,或者也许我很接近,但我没有掌握一些东西。 Any help that you can offer to lead me in the right direction, I appreciate it!您可以提供任何帮助以引导我朝着正确的方向前进,我很感激!

I don't fully understand what you really want but maybe this could help you?我不完全理解你真正想要什么,但也许这可以帮助你?

pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' + (out.columns + 1).astype(str)
out = df2.join(out)
print(out)

# Output:
                                       Ingredient_Name  WordCount  Num_Of_Ingredients  \
0          ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE          4                   3   
1                      ACEBUTOLOL ACETYLSALICYLIC ACID          3                   2   
2  COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE          5                   4   
3                                              BROMIDE          1                   1   

  Ingredient_Name_1     Ingredient_Name_2 Ingredient_Name_3   Ingredient_Name_4  
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN  
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN  
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE  
3           BROMIDE                   NaN               NaN                 NaN  
  1. Use str.extractall to get all matches使用str.extractall获取所有匹配项
  2. unstack to convert to individual columns unstack以转换为单个列
output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()

#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")

>>> output
  Ingredient_Name_0     Ingredient_Name_1 Ingredient_Name_2   Ingredient_Name_3
0     ACETAMINOPHEN  ACETYLSALICYLIC ACID          CAFFEINE                 NaN
1        ACEBUTOLOL  ACETYLSALICYLIC ACID               NaN                 NaN
2          COLISTIN        HYDROCORTISONE          NEOMYCIN  THONZONIUM BROMIDE
3           BROMIDE                   NaN               NaN                 NaN

In order to maintain a column of unmatched ingredients, the best I could come up with was this.为了保持一列无与伦比的成分,我能想到的最好的就是这个。 If unmatched ingredients aren't as important you're better off using the other inbuilt functions for string and pattern matching mentioned in the other answers.如果不匹配的成分不那么重要,那么最好使用其他答案中提到的其他内置函数进行字符串和模式匹配。 This is probably not the most efficient way to do this.这可能不是最有效的方法。

def match_ingredients(row, df):
  base_str = row['Ingredient_Name']
  result_count = 1
  result = {}
  for idx, ingredient in df.iterrows():
    if ingredient['Ingredient_Name'] in base_str:
        result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
        result_count += 1
        base_str = base_str.replace(ingredient['Ingredient_Name'], "")
  result['Ingredient_Name'] = base_str

  return result

result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))

df2.apply(match_ingredients) Executes the function over each row of df2 and combines the row type response of the function into another dataframe. df2.apply(match_ingredients)df2的每一行上执行 function 并将 function 的行类型响应合并到另一个 Z6A8064B5DF479455500Z53C7 中It takes df1 as aa parameter so that we can iterate over every ingredient (This can be modified to a list of ingredients as well) and in can be used as a substring check in native Python.它将df1作为 aa 参数,以便我们可以遍历每种成分(也可以将其修改为成分列表),并且可以in本机 Python 中用作 substring 检查。 If the string is inside the total ingredient list then we use replace to "subtract" it from the total list of ingredients.如果字符串在总成分列表中,那么我们使用replace从总成分列表中“减去”它。

The other thing here is that the dictionary returned will have its keys treated as column names so we can assign the remaining base string (after replacing all matching strings) to the constant column name Ingredient Name .这里的另一件事是返回的字典将其键视为列名,因此我们可以将剩余的基本字符串(在替换所有匹配的字符串之后)分配给常量列名Ingredient Name

result_type = 'expand' implies that the response of the function is to be turned into multiple columns if possible. result_type = 'expand'意味着 function 的响应将尽可能转换为多列。

Docs for apply . 申请文件。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何比较两个相同大小的数据框并创建一个新的数据框,而在列中没有具有相同值的行 - How to compare two dataframes of the same size and create a new one without the rows that have the same value in a column 当一行中某一列的值与另一行另一列中的值匹配时,如何匹配pyspark数据框中的两行? - How can I match two rows in a pyspark dataframe when the value in a column in a row matches the value in another column in another row? 我有两个数据框。 我想将一个数据框的标题与另一数据框的一列的内容进行比较 - I have two dataframes. I wanted to compare header of one dataframe with the content of one column in another dataframe Pandas:检查 substring 是否存在于另一列中,然后创建一个具有特定值的新列 - Pandas: Check if a substring exists in another column then create a new column with a specific value 使用pandas,如何比较两个数据帧中2列之间的值并将它们推送到新的数据帧? - Using pandas, how can I compare the values between 2 columns from two dataframes and push them to a new dataframe? 如何根据另一个 dataframe 的匹配为 dataframe 的新列添加值? - how to add value to a new column to a dataframe based on the match of another dataframe? 如何比较两个数据帧之间特定列中的值以检查较新的 dataframe 列中是否有新值? - How to compare values in a specific column between 2 dataframes to check if there's a new value in the newer dataframe column? 如果数据框存在于另一个数据框列中,则搜索它的子字符串 - Searching substring of a dataframe if it exists in another dataframe column 如何连接两个ID不匹配的数据框,并创建新列以表示ID来自何数据框? - How to join two dataframes where IDs do not match and create new column to represent what dataframe ID came from? 如何使用列索引比较两个数据框? - How to compare two dataframes using column index?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM