简体   繁体   English

基于多列 python 比较两个数据帧的缺失行

[英]Compare two dataframes for missing rows based on multiple columns python

I want to compare two dataframes that have similar columns(not all) and print a new dataframe that shows the missing rows of df1 compare to df2 and a second dataframe that shows this time the missing values of df2 compare to df1 based on given columns.我想比较两个具有相似列(不是全部)的数据帧,并打印一个新的 dataframe 显示 df1 的缺失行与 df2 比较,第二个 dataframe 显示这次 df2 的缺失值与 df1 基于给定列的比较。

Here the "key_columns" are named key_column1 and key_column2这里的“key_columns”被命名为 key_column1 和 key_column2

import pandas as pd


data1 = {'first_column':  ['4', '2', '7', '2', '2'],
        'second_column': ['1', '2', '2', '2', '2'],
       'key_column1':['1', '3', '2', '6', '4'],
      'key_column2':['1', '2', '2', '1', '1'],
       'fourth_column':['1', '2', '2', '2', '2'],
         'other':['1', '2', '3', '2', '2'],
        }
df1 = pd.DataFrame(data1)

data2 = {'first':  ['1', '2', '2', '2', '2'],
        'second_column': ['1', '2', '2', '2', '2'],
       'key_column1':['1', '3', '2', '6', '4'],
      'key_column2':['1', '5', '2', '2', '2'],
       'fourth_column':['1', '2', '2', '2', '2'],
          'other2':['1', '4', '3', '2', '2'],
         'other3':['6', '8', '1', '4', '2'],
        }

df2 = pd.DataFrame(data2)

在此处输入图像描述

If you outer merge on the 2 key columns, with an additional unique column in the second dataframe, that unique column will show Nan where the row is in the first dataframe but not the second.如果您在 2 个键列上进行外部合并,并在第二个 dataframe 中添加一个额外的唯一列,则该唯一列将显示Nan ,该行位于第一个 dataframe 而不是第二个。 For example:例如:

df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')

gives:给出:

  first second_column key_column1  ... other2 other3 first_column
0     1             1           1  ...      1      6            4
1     2             2           3  ...      4      8          NaN
2     2             2           2  ...      3      1            7
3     2             2           6  ...      2      4          NaN
4     2             2           4  ...      2      2          NaN
5   NaN           NaN           3  ...    NaN    NaN            2
6   NaN           NaN           6  ...    NaN    NaN            2
7   NaN           NaN           4  ...    NaN    NaN            2

Here the Nans in 'first_column' correspond to the rows in df2 that are not in df1.这里'first_column'中的Nans对应于df2中不在df1中的行。 You can then use this fact with .loc[] to filter on those Nan rows, and only the columns in df2 like so:然后,您可以将这个事实与.loc[]一起使用来过滤那些 Nan 行,并且只有 df2 中的列像这样:

df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]

Output: Output:

  first second_column key_column1 key_column2 fourth_column other2 other3
1     2             2           3           5             2      4      8
3     2             2           6           2             2      2      4
4     2             2           4           2             2      2      2

Full code for both tables is:两个表的完整代码是:

df2_outer = df2.merge(df1[['key_column1', 'key_column2', 'first_column']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df1 compare df2')
df2_output = df2_outer.loc[df2_outer['first_column'].isna(), df2.columns]
print(df2_output)

df1_outer = df1.merge(df2[['key_column1', 'key_column2', 'first']], on=['key_column1', 'key_column2'], how='outer')
print('missing values of df2 compare df1')
df1_output = df1_outer.loc[df1_outer['first'].isna(), df1.columns]
print(df1_output)

Which outputs:哪个输出:

missing values of df1 compare df2
  first second_column key_column1 key_column2 fourth_column other2 other3
1     2             2           3           5             2      4      8
3     2             2           6           2             2      2      4
4     2             2           4           2             2      2      2
missing values of df2 compare df1
  first_column second_column key_column1 key_column2 fourth_column other
1            2             2           3           2             2     2
3            2             2           6           1             2     2
4            2             2           4           1             2     2

I have modified the data1 and data2 dictionaries so that the resulting dataframes have only same columns to demonstrate that the solution provided in the answer by Emi OB relies on existence of columns in one dataframe which are not in the other one ( in case a common column is used the code fails with KeyError on the column chosen to collect NaNs).我已经修改了 data1 和 data2 字典,以便生成的数据帧只有相同的列,以证明Emi OB在答案中提供的解决方案依赖于一个 dataframe 中的列的存在,而另一个 dataframe使用代码在选择收集 NaN 的列上出现 KeyError 失败)。 Below an improved version which does not suffer from that limitation creating own columns for the purpose of collecting NaNs:下面的改进版本不受该限制的影响,为收集 NaN 创建自己的列:

df1['df1_NaNs'] = '' # create additional column to collect NaNs
df2['df2_NaNs'] = '' # create additional column to collect NaNs

df1_s = df1.merge(df2[['key_column1', 'key_column2', 'df2_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df2 = df2.drop(columns=["df2_NaNs"])   # clean up df2 
df1_s = df1_s.loc[df1_s['df2_NaNs'].isna(), df1.columns]
df1_s = df1_s.drop(columns=["df1_NaNs"]) # clean up df1_s 
print(df1_s)

print('--------------------------------------------')

df2_s = df2.merge(df1[['key_column1', 'key_column2', 'df1_NaNs']], on=['key_column1', 'key_column2'], how='outer')
df1 = df1.drop(columns=["df1_NaNs"])   # clean up df1 
df2_s = df2_s.loc[df2_s['df1_NaNs'].isna(), df2.columns]
df2_s = df2_s.drop(columns=["df2_NaNs"]) # clean up df2_s 
print(df2_s)

gives:给出:

  first second_column key_column1 key_column2 fourth_column
1     2             2           3           2             2
3     2             2           6           1             2
4     2             2           4           1             2
--------------------------------------------
  first second_column key_column1 key_column2 fourth_column
1     2             2           3           5             3
3     2             2           6           2             5
4     2             2           4           2             6

Also the code below works in case the columns of both dataframes are the same and in addition saves memory and computation time by not creating temporary full-sized dataframes required to achieve the final result:如果两个数据帧的列相同,下面的代码也可以工作,此外,通过不创建实现最终结果所需的临时全尺寸数据帧,节省了 memory 和计算时间:

""" I want to compare two dataframes that have similar columns(not all) 
and print a new dataframe that shows the missing rows of df1 compare to 
df2 and a second dataframe that shows this time the missing values of 
df2 compare to df1 based on given columns. Here the "key_columns"
"""
import pandas as pd
#data1 ={  'first_column':['4', '2', '7', '2', '2'],
data1 = {         'first':['4', '2', '7', '2', '2'],
          'second_column':['1', '2', '2', '2', '2'],
            'key_column1':['1', '3', '2', '6', '4'],
            'key_column2':['1', '2', '2', '1', '1'],
          'fourth_column':['1', '2', '2', '2', '2'],
#                 'other':['1', '2', '3', '2', '2'],
        }
df1 = pd.DataFrame(data1)
#print(df1)

data2 = {  'first':['1', '2', '2', '2', '2'],
   'second_column':['1', '2', '2', '2', '2'],
     'key_column1':['1', '3', '2', '6', '4'],
     'key_column2':['1', '5', '2', '2', '2'],
#  'fourth_column':['1', '2', '2', '2', '2'],
   'fourth_column':['2', '3', '4', '5', '6'],
#         'other2':['1', '4', '3', '2', '2'],
#         'other3':['6', '8', '1', '4', '2'],
        }
df2 = pd.DataFrame(data2)
#print(df2)

data1_key_cols = dict.fromkeys( zip(data1['key_column1'], data1['key_column2']) )
data2_key_cols = dict.fromkeys( zip(data2['key_column1'], data2['key_column2']) )
# for Python versions < 3.7 (dictionaries are not ordered):
#data1_key_cols = list(zip(data1['key_column1'], data1['key_column2']))
#data2_key_cols = list(zip(data2['key_column1'], data2['key_column2']))
from collections import defaultdict
missing_data2_in_data1 = defaultdict(list)
missing_data1_in_data2 = defaultdict(list)

for indx, val in enumerate(data1_key_cols.keys()):
#for indx, val in enumerate(data1_key_cols): # for Python version < 3.7
    if val not in data2_key_cols:
        for key, val in data1.items():
            missing_data1_in_data2[key].append(data1[key][indx])
for indx, val in enumerate(data2_key_cols.keys()):
#for indx, val in enumerate(data2_key_cols): # for Python version < 3.7
    if val not in data1_key_cols:
        for key, val in data2.items():
            missing_data2_in_data1[key].append(data2[key][indx])
df1_s = pd.DataFrame(missing_data1_in_data2)
df2_s = pd.DataFrame(missing_data2_in_data1)
print(df1_s)
print('--------------------------------------------')
print(df2_s)

prints印刷

  first second_column key_column1 key_column2 fourth_column
0     2             2           3           2             2
1     2             2           6           1             2
2     2             2           4           1             2
--------------------------------------------
  first second_column key_column1 key_column2 fourth_column
0     2             2           3           5             3
1     2             2           6           2             5
2     2             2           4           2             6

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM