简体   繁体   中英

pandas compare two dataframes and their columns to find difference by reference column

I'm trying to compare two dataframes in pandas based on a ref column and find the differences.

Dataframes looks like below

"Dataframe 1":
ref        key1        key2        key3        key4        key5
001        vk11        vk12        vk13        vk14        vk15
002        vk21        vk22        vk23        vk24        vk25
003        vk31        vk32        vk33        vk34        vk35
004        vk41        vk42        vk43        vk44        vk45
005        vk51        vk52        vk53        vk54        vk55
006        vk61        vk62        vk63        vk64        vk65


"Dataframe 2":
ref        key1        key2        key3        key4        key5
001        vk11        vk12        vk13        vk14        vk15
002        vk21        vkkk        vk23        vk24        vk25
003        vk31        vk32        vk33        vkkk        vk35
005        vkkk        vkkk        vkkk        vk54        vk55

Final result set should look like below.

  1. Remove references that doesn't exist in dataframe 2
  2. Remove the rows that matches exactly the same
  3. Final OP should be as below

"Final Dataframe":

 key key1 key1 key1 key2 key2 key2 key3 key3 key3 key4 key4 key4 key5 key5 key5 Hdr DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC 002 vk21 vk21 N vk22 vkkk Y vk23 vk23 N vk24 vk24 N vk25 vk25 N 003 vk31 vk31 N vk32 vk32 N vk33 vk33 N vk34 vkkk Y vk35 vk35 N 005 vk51 vkkk Y vk52 vkkk Y vk53 vkkk Y vk54 vk54 Y vk55 vk55 N

PC: VALC - value changed; DF1 - Dataframe 1; DF2 - Dataframe 2;

Use concat with inner join and keys parameter first:

df = pd.concat([df1.set_index('ref'), df2.set_index('ref')], 
               axis=1, 
               join='inner',
               keys=('df1','df2'))

Then compare selected values by DataFrame.xs and create new DataFrame with numpy.where and MultiIndex.from_product :

mask = df.xs('df1', axis=1).eq(df.xs('df2', axis=1))

df1 = pd.DataFrame(np.where(mask, 'N','Y'), 
                  index=mask.index,
                  columns=pd.MultiIndex.from_product([['valc'], mask.columns]))
print (df1)
    valc                    
    key1 key2 key3 key4 key5
ref                         
1      N    N    N    N    N
2      N    Y    N    N    N
3      N    N    N    Y    N
5      Y    Y    Y    N    N

Join together and sorting columns:

df = pd.concat([df, df1], axis=1).sort_index(axis=1, level=[1,0])

Remove equal rows by DataFrame.all and inverted mask with ~ :

df = df[~mask.all(axis=1)]

print (df)
      df1   df2 valc   df1   df2 valc   df1   df2 valc   df1   df2 valc   df1  \
     key1  key1 key1  key2  key2 key2  key3  key3 key3  key4  key4 key4  key5   
ref                                                                             
2    vk21  vk21    N  vk22  vkkk    Y  vk23  vk23    N  vk24  vk24    N  vk25   
3    vk31  vk31    N  vk32  vk32    N  vk33  vk33    N  vk34  vkkk    Y  vk35   
5    vk51  vkkk    Y  vk52  vkkk    Y  vk53  vkkk    Y  vk54  vk54    N  vk55   

      df2 valc  
     key5 key5  
ref             
2    vk25    N  
3    vk35    N  
5    vk55    N  

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM