[英]Compare two dataframes for missing rows based on multiple columns python
我想比較兩個具有相似列(不是全部)的數據幀,並打印一個新的 dataframe 顯示 df1 的缺失行與 df2 比較,第二個 dataframe 顯示這次 df2 的缺失值與 df1 基於給定列的比較。
這里的“key_columns”被命名為 key_column1 和 key_column2
import pandas as pd
data1 = {'first_column': ['4', '2', '7', '2', '2'],
'second_column': ['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '2', '2', '1', '1'],
'fourth_column':['1', '2', '2', '2', '2'],
'other':['1', '2', '3', '2', '2'],
}
df1 = pd.DataFrame(data1)
data2 = {'first': ['1', '2', '2', '2', '2'],
'second_column': ['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '5', '2', '2', '2'],
'fourth_column':['1', '2', '2', '2', '2'],
'other2':['1', '4', '3', '2', '2'],
'other3':['6', '8', '1', '4', '2'],
}
df2 = pd.DataFrame(data2)
如果您在 2 個鍵列上進行外部合並,並在第二個 dataframe 中添加一個額外的唯一列,則該唯一列將顯示Nan
,該行位於第一個 dataframe 而不是第二個。 例如:
df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
給出:
first second_column key_column1 ... other2 other3 first_column
0 1 1 1 ... 1 6 4
1 2 2 3 ... 4 8 NaN
2 2 2 2 ... 3 1 7
3 2 2 6 ... 2 4 NaN
4 2 2 4 ... 2 2 NaN
5 NaN NaN 3 ... NaN NaN 2
6 NaN NaN 6 ... NaN NaN 2
7 NaN NaN 4 ... NaN NaN 2
這里'first_column'中的Nans對應於df2中不在df1中的行。 然后,您可以將這個事實與.loc[]
一起使用來過濾那些 Nan 行,並且只有 df2 中的列像這樣:
df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
Output:
first second_column key_column1 key_column2 fourth_column other2 other3
1 2 2 3 5 2 4 8
3 2 2 6 2 2 2 4
4 2 2 4 2 2 2 2
兩個表的完整代碼是:
df2_outer = df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df1 compare df2')
df2_output = df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
print(df2_output)
df1_outer = df1.merge(df2[['key_column1', 'key_column2', 'first']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df2 compare df1')
df1_output = df1_outer.loc[df1_outer['first'].isna(), df1.columns]
print(df1_output)
哪個輸出:
missing values of df1 compare df2
first second_column key_column1 key_column2 fourth_column other2 other3
1 2 2 3 5 2 4 8
3 2 2 6 2 2 2 4
4 2 2 4 2 2 2 2
missing values of df2 compare df1
first_column second_column key_column1 key_column2 fourth_column other
1 2 2 3 2 2 2
3 2 2 6 1 2 2
4 2 2 4 1 2 2
我已經修改了 data1 和 data2 字典,以便生成的數據幀只有相同的列,以證明Emi OB在答案中提供的解決方案依賴於一個 dataframe 中的列的存在,而另一個 dataframe使用代碼在選擇收集 NaN 的列上出現 KeyError 失敗)。 下面的改進版本不受該限制的影響,為收集 NaN 創建自己的列:
df1['df1_NaNs'] = '' # create additional column to collect NaNs
df2['df2_NaNs'] = '' # create additional column to collect NaNs
df1_s = df1.merge(df2[['key_column1', 'key_column2', 'df2_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df2 = df2.drop(columns=["df2_NaNs"]) # clean up df2
df1_s = df1_s.loc[df1_s['df2_NaNs'].isna(), df1.columns]
df1_s = df1_s.drop(columns=["df1_NaNs"]) # clean up df1_s
print(df1_s)
print('--------------------------------------------')
df2_s = df2.merge(df1[['key_column1', 'key_column2', 'df1_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df1 = df1.drop(columns=["df1_NaNs"]) # clean up df1
df2_s = df2_s.loc[df2_s['df1_NaNs'].isna(), df2.columns]
df2_s = df2_s.drop(columns=["df2_NaNs"]) # clean up df2_s
print(df2_s)
給出:
first second_column key_column1 key_column2 fourth_column
1 2 2 3 2 2
3 2 2 6 1 2
4 2 2 4 1 2
--------------------------------------------
first second_column key_column1 key_column2 fourth_column
1 2 2 3 5 3
3 2 2 6 2 5
4 2 2 4 2 6
如果兩個數據幀的列相同,下面的代碼也可以工作,此外,通過不創建實現最終結果所需的臨時全尺寸數據幀,節省了 memory 和計算時間:
""" I want to compare two dataframes that have similar columns(not all)
and print a new dataframe that shows the missing rows of df1 compare to
df2 and a second dataframe that shows this time the missing values of
df2 compare to df1 based on given columns. Here the "key_columns"
"""
import pandas as pd
#data1 ={ 'first_column':['4', '2', '7', '2', '2'],
data1 = { 'first':['4', '2', '7', '2', '2'],
'second_column':['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '2', '2', '1', '1'],
'fourth_column':['1', '2', '2', '2', '2'],
# 'other':['1', '2', '3', '2', '2'],
}
df1 = pd.DataFrame(data1)
#print(df1)
data2 = { 'first':['1', '2', '2', '2', '2'],
'second_column':['1', '2', '2', '2', '2'],
'key_column1':['1', '3', '2', '6', '4'],
'key_column2':['1', '5', '2', '2', '2'],
# 'fourth_column':['1', '2', '2', '2', '2'],
'fourth_column':['2', '3', '4', '5', '6'],
# 'other2':['1', '4', '3', '2', '2'],
# 'other3':['6', '8', '1', '4', '2'],
}
df2 = pd.DataFrame(data2)
#print(df2)
data1_key_cols = dict.fromkeys( zip(data1['key_column1'], data1['key_column2']) )
data2_key_cols = dict.fromkeys( zip(data2['key_column1'], data2['key_column2']) )
# for Python versions < 3.7 (dictionaries are not ordered):
#data1_key_cols = list(zip(data1['key_column1'], data1['key_column2']))
#data2_key_cols = list(zip(data2['key_column1'], data2['key_column2']))
from collections import defaultdict
missing_data2_in_data1 = defaultdict(list)
missing_data1_in_data2 = defaultdict(list)
for indx, val in enumerate(data1_key_cols.keys()):
#for indx, val in enumerate(data1_key_cols): # for Python version < 3.7
if val not in data2_key_cols:
for key, val in data1.items():
missing_data1_in_data2[key].append(data1[key][indx])
for indx, val in enumerate(data2_key_cols.keys()):
#for indx, val in enumerate(data2_key_cols): # for Python version < 3.7
if val not in data1_key_cols:
for key, val in data2.items():
missing_data2_in_data1[key].append(data2[key][indx])
df1_s = pd.DataFrame(missing_data1_in_data2)
df2_s = pd.DataFrame(missing_data2_in_data1)
print(df1_s)
print('--------------------------------------------')
print(df2_s)
印刷
first second_column key_column1 key_column2 fourth_column
0 2 2 3 2 2
1 2 2 6 1 2
2 2 2 4 1 2
--------------------------------------------
first second_column key_column1 key_column2 fourth_column
0 2 2 3 5 3
1 2 2 6 2 5
2 2 2 4 2 6
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.