[英]How to compare two dataframes of the same size and create a new one without the rows that have the same value in a column
[英]Using two dataframes how can I compare a lookup value as a substring in the column in another dataframe to create a new column if the match exists
我正在嘗試使用兩個數據幀,一個作為查找表來查找與我的數據集數據幀列中的值匹配的 substring。 找到該值后,我想使用該值創建一個新列並遍歷整個列,並從初始列中刪除匹配的 substring 並循環直到沒有更多匹配項。
我遇到的第一個問題是我無法匹配或返回匹配的值,除非它是一個精確的字符串。 棘手的部分有時是成分名稱包含單個成分的多個單詞。
這是我的代碼的較小示例,注釋部分包括錯誤或我嘗試過的問題:
import pandas as pd
singleingredientdata = {
'Ingredient_Name':['ACEBUTOLOL','ACETAMINOPHEN','ACETYLSALICYLIC ACID','CAFFEINE','COLISTIN','HYDROCORTISONE','NEOMYCIN','THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[1,1,2,1,1,1,1,2,1],
'Num_Of_Ingredients':[1,1,1,1,1,1,1,1,1]
}
multiingredientdata = {
'Ingredient_Name':['ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE','ACEBUTOLOL ACETYLSALICYLIC ACID','COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE','BROMIDE'],
'WordCount':[4,3,5,1],
'Num_Of_Ingredients':[3,2,4,1]
}
df1 = pd.DataFrame(data=singleingredientdata)
df2 = pd.DataFrame(data=multiingredientdata)
ingredientcount = df2["Num_Of_Ingredients"]
max_value = ingredientcount.max()
df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name']) ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]] ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any()) TypeError: first argument must be string or compiled pattern
iterator = 1
for j in range(0,max_value):
col_name = 'Ingredient_Name' + str(iterator)
# contain_values = df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
# df2[col_name]= df1[df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'])]
iterator += 1
print(df2)
理想情況下,我的結果如下所示:
Ingredient_Name Ingredient_Name1 Igredient_Name2 Ingredient_Name3 Ingredient_Name4
ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE
ACEBUTOLOL ACETYLSALICYLIC ACID
COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
BROMIDE
原始成分名稱將包含在查找中未找到的任何值,在此示例中沒有。
到目前為止,我試圖在成分上得到匹配的是以下我已經包含了錯誤消息和該代碼行的問題:
df2['Exists'] = df2['Ingredient_Name'].isin(df1['Ingredient_Name']) ##Doesn't flag True unless it finds a single igredient exists in the string
##df2['Exists Value'] = df2['Ingredient_Name'].map(lambda x: df1['Ingredient_Name'] if df2['Ingredient_Name'] in x else '') error in regards to requiring string not series TypeError: 'in <string>' requires string as left operand, not Series
#df2['Value'] = df2[[x[1] in x[1] for x in zip(df1['Ingredient_Name'], df2['Ingredient_Name'])]] ## passing 4 items instead of a single pass being implied??
##boolean_findings = df2['Ingredient_Name'].str.contains(df1['Ingredient_Name'].any()) TypeError: first argument must be string or compiled pattern
我能夠在確切字符串上匹配的部分返回以下結果,但我想返回值而不是 true/fase 並在 substring 上匹配而不是完全匹配:
Ingredient_Name WordCount Num_Of_Ingredients Exists
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE 4 3 False
1 ACEBUTOLOL ACETYLSALICYLIC ACID 3 2 False
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BR... 5 4 False
3 BROMIDE 1 1 True
也許我以錯誤的方式解決這個問題,或者也許我很接近,但我沒有掌握一些東西。 您可以提供任何幫助以引導我朝着正確的方向前進,我很感激!
我不完全理解你真正想要什么,但也許這可以幫助你?
pattern = '|'.join(df1['Ingredient_Name'].tolist())
out = df2['Ingredient_Name'].str.findall(pattern).apply(pd.Series)
out.columns = 'Ingredient_Name_' + (out.columns + 1).astype(str)
out = df2.join(out)
print(out)
# Output:
Ingredient_Name WordCount Num_Of_Ingredients \
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE 4 3
1 ACEBUTOLOL ACETYLSALICYLIC ACID 3 2
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE 5 4
3 BROMIDE 1 1
Ingredient_Name_1 Ingredient_Name_2 Ingredient_Name_3 Ingredient_Name_4
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE NaN
1 ACEBUTOLOL ACETYLSALICYLIC ACID NaN NaN
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
3 BROMIDE NaN NaN NaN
str.extractall
獲取所有匹配項unstack
以轉換為單個列output = df2['Ingredient_Name'].str.extractall(f"({'|'.join(df1['Ingredient_Name'])})").unstack()
#formatting
output = output.droplevel(0,1).rename_axis(None, axis=1).add_prefix("Ingredient_Name_")
>>> output
Ingredient_Name_0 Ingredient_Name_1 Ingredient_Name_2 Ingredient_Name_3
0 ACETAMINOPHEN ACETYLSALICYLIC ACID CAFFEINE NaN
1 ACEBUTOLOL ACETYLSALICYLIC ACID NaN NaN
2 COLISTIN HYDROCORTISONE NEOMYCIN THONZONIUM BROMIDE
3 BROMIDE NaN NaN NaN
為了保持一列無與倫比的成分,我能想到的最好的就是這個。 如果不匹配的成分不那么重要,那么最好使用其他答案中提到的其他內置函數進行字符串和模式匹配。 這可能不是最有效的方法。
def match_ingredients(row, df):
base_str = row['Ingredient_Name']
result_count = 1
result = {}
for idx, ingredient in df.iterrows():
if ingredient['Ingredient_Name'] in base_str:
result[f'Ingredient_{result_count}'] = ingredient['Ingredient_Name']
result_count += 1
base_str = base_str.replace(ingredient['Ingredient_Name'], "")
result['Ingredient_Name'] = base_str
return result
result = df2.apply(match_ingredients,axis=1, result_type='expand', args=(df1,))
df2.apply(match_ingredients)
在df2
的每一行上執行 function 並將 function 的行類型響應合並到另一個 Z6A8064B5DF479455500Z53C7 中它將df1
作為 aa 參數,以便我們可以遍歷每種成分(也可以將其修改為成分列表),並且可以in
本機 Python 中用作 substring 檢查。 如果字符串在總成分列表中,那么我們使用replace
從總成分列表中“減去”它。
這里的另一件事是返回的字典將其鍵視為列名,因此我們可以將剩余的基本字符串(在替換所有匹配的字符串之后)分配給常量列名Ingredient Name
。
result_type = 'expand'
意味着 function 的響應將盡可能轉換為多列。
申請文件。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.