简体   繁体   English

pandas 比较两个数据帧及其列以通过参考列查找差异

[英]pandas compare two dataframes and their columns to find difference by reference column

I'm trying to compare two dataframes in pandas based on a ref column and find the differences.我正在尝试基于 ref 列比较 pandas 中的两个数据帧并找出差异。

Dataframes looks like below数据框如下所示

"Dataframe 1":
ref        key1        key2        key3        key4        key5
001        vk11        vk12        vk13        vk14        vk15
002        vk21        vk22        vk23        vk24        vk25
003        vk31        vk32        vk33        vk34        vk35
004        vk41        vk42        vk43        vk44        vk45
005        vk51        vk52        vk53        vk54        vk55
006        vk61        vk62        vk63        vk64        vk65


"Dataframe 2":
ref        key1        key2        key3        key4        key5
001        vk11        vk12        vk13        vk14        vk15
002        vk21        vkkk        vk23        vk24        vk25
003        vk31        vk32        vk33        vkkk        vk35
005        vkkk        vkkk        vkkk        vk54        vk55

Final result set should look like below.最终结果集应如下所示。

  1. Remove references that doesn't exist in dataframe 2删除 dataframe 2 中不存在的引用
  2. Remove the rows that matches exactly the same删除完全匹配的行
  3. Final OP should be as below最终的操作应该如下

"Final Dataframe": “最终数据框”:

 key key1 key1 key1 key2 key2 key2 key3 key3 key3 key4 key4 key4 key5 key5 key5 Hdr DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC 002 vk21 vk21 N vk22 vkkk Y vk23 vk23 N vk24 vk24 N vk25 vk25 N 003 vk31 vk31 N vk32 vk32 N vk33 vk33 N vk34 vkkk Y vk35 vk35 N 005 vk51 vkkk Y vk52 vkkk Y vk53 vkkk Y vk54 vk54 Y vk55 vk55 N

PC: VALC - value changed; PC:VALC - 值已更改; DF1 - Dataframe 1; DF1 - Dataframe 1; DF2 - Dataframe 2; DF2 - Dataframe 2;

Use concat with inner join and keys parameter first:首先使用带有inner连接和键参数的concat

df = pd.concat([df1.set_index('ref'), df2.set_index('ref')], 
               axis=1, 
               join='inner',
               keys=('df1','df2'))

Then compare selected values by DataFrame.xs and create new DataFrame with numpy.where and MultiIndex.from_product :然后通过DataFrame.xs比较选定的值,并使用numpy.whereMultiIndex.from_product DataFrame

mask = df.xs('df1', axis=1).eq(df.xs('df2', axis=1))

df1 = pd.DataFrame(np.where(mask, 'N','Y'), 
                  index=mask.index,
                  columns=pd.MultiIndex.from_product([['valc'], mask.columns]))
print (df1)
    valc                    
    key1 key2 key3 key4 key5
ref                         
1      N    N    N    N    N
2      N    Y    N    N    N
3      N    N    N    Y    N
5      Y    Y    Y    N    N

Join together and sorting columns:连接在一起并对列进行排序:

df = pd.concat([df, df1], axis=1).sort_index(axis=1, level=[1,0])

Remove equal rows by DataFrame.all and inverted mask with ~ :通过DataFrame.all删除相等的行并使用~反转掩码:

df = df[~mask.all(axis=1)]

print (df)
      df1   df2 valc   df1   df2 valc   df1   df2 valc   df1   df2 valc   df1  \
     key1  key1 key1  key2  key2 key2  key3  key3 key3  key4  key4 key4  key5   
ref                                                                             
2    vk21  vk21    N  vk22  vkkk    Y  vk23  vk23    N  vk24  vk24    N  vk25   
3    vk31  vk31    N  vk32  vk32    N  vk33  vk33    N  vk34  vkkk    Y  vk35   
5    vk51  vkkk    Y  vk52  vkkk    Y  vk53  vkkk    Y  vk54  vk54    N  vk55   

      df2 valc  
     key5 key5  
ref             
2    vk25    N  
3    vk35    N  
5    vk55    N  

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM