[英]Using two dataframes how can I compare a lookup value as a substring in the column in another dataframe to create a new column if the match exists
I'm attempting to use two dataframes, one as a lookup table to find a substring match on the value in my datasets dataframes column.我正在尝试使用两个数据帧,一个作为查找表来查找与我的数据集数据帧列中的值匹配的 substring。 After I find the value, I'd like to create a new column with that value and iterate through the entire column and remove the matched substring from the initial column and loop through until there are no more matches.找到该值后,我想使用该值创建一个新列并遍历整个列,并从初始列中删除匹配的 substring 并循环直到没有更多匹配项。
The first problem I'm having is I'm not able to match or return the value of the match unless it's an exact string.我遇到的第一个问题是我无法匹配或返回匹配的值,除非它是一个精确的字符串。 The tricky part is sometimes the Ingredient_Name contains multiple words for a single ingredient.棘手的部分有时是成分名称包含单个成分的多个单词。
This is a smaller sample of my code, the commented sections include the error or the problem with what I tried:这是我的代码的较小示例,注释部分包括错误或我尝试过的问题:
import pandas as pd
singleingredientdata = {
'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}
multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}
df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()
df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name']) ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]] ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any()) TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
col_name = 'Ingredient_Name' + str(iterator)
# contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
# df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
iterator += 1
print(df2)
Ideally my results would look like this:理想情况下,我的结果如下所示:
Ingredient_Name Ingredient_Name1 Igredient_Name2 Ingredient_Name3 Ingredient_Name4
ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE
ACEBUTOLOL ACETYLSALICYLIC ACID
COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
BROMIDE
The original Ingredient_Name would contain any values that were not found in the lookup, in this example there are none.原始成分名称将包含在查找中未找到的任何值,在此示例中没有。
What I've attempted to get the match on the ingredients so far is the following I've included the error messages and the issue with that line of code:到目前为止,我试图在成分上得到匹配的是以下我已经包含了错误消息和该代码行的问题:
df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name']) ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]] ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any()) TypeError: first argument must be string or compiled pattern
The part I'm able to match on the exact string returns the following results, but I'd like to return the value instead of true/fase and match on the substring not the exact match:我能够在确切字符串上匹配的部分返回以下结果,但我想返回值而不是 true/fase 并在 substring 上匹配而不是完全匹配:
Ingredient_Name WordCount Num_Of_Ingredients Exists
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE 4 3 False
1 ACEBUTOLOL ACETYLSALICYLIC ACID 3 2 False
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR... 5 4 False
3 BROMIDE 1 1 True
Perhaps I'm going about this problem the wrong way, or maybe I'm close, but I'm not grasping something.也许我以错误的方式解决这个问题,或者也许我很接近,但我没有掌握一些东西。 Any help that you can offer to lead me in the right direction, I appreciate it!您可以提供任何帮助以引导我朝着正确的方向前进,我很感激!
I don't fully understand what you really want but maybe this could help you?我不完全理解你真正想要什么,但也许这可以帮助你?
pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' + (out.columns + 1).astype(str)
out = df2.join(out)
print(out)
# Output:
Ingredient_Name WordCount Num_Of_Ingredients \
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE 4 3
1 ACEBUTOLOL ACETYLSALICYLIC ACID 3 2
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE 5 4
3 BROMIDE 1 1
Ingredient_Name_1 Ingredient_Name_2 Ingredient_Name_3 Ingredient_Name_4
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE NaN
1 ACEBUTOLOL ACETYLSALICYLIC ACID NaN NaN
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
3 BROMIDE NaN NaN NaN
str.extractall
to get all matches使用str.extractall
获取所有匹配项unstack
to convert to individual columns unstack
以转换为单个列output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()
#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")
>>> output
Ingredient_Name_0 Ingredient_Name_1 Ingredient_Name_2 Ingredient_Name_3
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE NaN
1 ACEBUTOLOL ACETYLSALICYLIC ACID NaN NaN
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
3 BROMIDE NaN NaN NaN
In order to maintain a column of unmatched ingredients, the best I could come up with was this.为了保持一列无与伦比的成分,我能想到的最好的就是这个。 If unmatched ingredients aren't as important you're better off using the other inbuilt functions for string and pattern matching mentioned in the other answers.如果不匹配的成分不那么重要,那么最好使用其他答案中提到的其他内置函数进行字符串和模式匹配。 This is probably not the most efficient way to do this.这可能不是最有效的方法。
def match_ingredients(row, df):
base_str = row['Ingredient_Name']
result_count = 1
result = {}
for idx, ingredient in df.iterrows():
if ingredient['Ingredient_Name'] in base_str:
result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
result_count += 1
base_str = base_str.replace(ingredient['Ingredient_Name'], "")
result['Ingredient_Name'] = base_str
return result
result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))
df2.apply(match_ingredients)
Executes the function over each row of df2
and combines the row type response of the function into another dataframe. df2.apply(match_ingredients)
在df2
的每一行上执行 function 并将 function 的行类型响应合并到另一个 Z6A8064B5DF479455500Z53C7 中It takes df1
as aa parameter so that we can iterate over every ingredient (This can be modified to a list of ingredients as well) and in
can be used as a substring check in native Python.它将df1
作为 aa 参数,以便我们可以遍历每种成分(也可以将其修改为成分列表),并且可以in
本机 Python 中用作 substring 检查。 If the string is inside the total ingredient list then we use replace
to "subtract" it from the total list of ingredients.如果字符串在总成分列表中,那么我们使用replace
从总成分列表中“减去”它。
The other thing here is that the dictionary returned will have its keys treated as column names so we can assign the remaining base string (after replacing all matching strings) to the constant column name Ingredient Name
.这里的另一件事是返回的字典将其键视为列名,因此我们可以将剩余的基本字符串(在替换所有匹配的字符串之后)分配给常量列名Ingredient Name
。
result_type = 'expand'
implies that the response of the function is to be turned into multiple columns if possible. result_type = 'expand'
意味着 function 的响应将尽可能转换为多列。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.