[英]pandas compare two dataframes and their columns to find difference by reference column
I'm trying to compare two dataframes in pandas based on a ref column and find the differences.我正在尝试基于 ref 列比较 pandas 中的两个数据帧并找出差异。
Dataframes looks like below数据框如下所示
"Dataframe 1":
ref key1 key2 key3 key4 key5
001 vk11 vk12 vk13 vk14 vk15
002 vk21 vk22 vk23 vk24 vk25
003 vk31 vk32 vk33 vk34 vk35
004 vk41 vk42 vk43 vk44 vk45
005 vk51 vk52 vk53 vk54 vk55
006 vk61 vk62 vk63 vk64 vk65
"Dataframe 2":
ref key1 key2 key3 key4 key5
001 vk11 vk12 vk13 vk14 vk15
002 vk21 vkkk vk23 vk24 vk25
003 vk31 vk32 vk33 vkkk vk35
005 vkkk vkkk vkkk vk54 vk55
Final result set should look like below.最终结果集应如下所示。
"Final Dataframe":
“最终数据框”:
key key1 key1 key1 key2 key2 key2 key3 key3 key3 key4 key4 key4 key5 key5 key5 Hdr DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC DF-1 DF-2 VALC 002 vk21 vk21 N vk22 vkkk Y vk23 vk23 N vk24 vk24 N vk25 vk25 N 003 vk31 vk31 N vk32 vk32 N vk33 vk33 N vk34 vkkk Y vk35 vk35 N 005 vk51 vkkk Y vk52 vkkk Y vk53 vkkk Y vk54 vk54 Y vk55 vk55 N
PC: VALC - value changed; PC:VALC - 值已更改; DF1 - Dataframe 1;
DF1 - Dataframe 1; DF2 - Dataframe 2;
DF2 - Dataframe 2;
Use concat
with inner
join and keys parameter first:首先使用带有
inner
连接和键参数的concat
:
df = pd.concat([df1.set_index('ref'), df2.set_index('ref')],
axis=1,
join='inner',
keys=('df1','df2'))
Then compare selected values by DataFrame.xs
and create new DataFrame
with numpy.where
and MultiIndex.from_product
:然后通过
DataFrame.xs
比较选定的值,并使用numpy.where
和MultiIndex.from_product
DataFrame
mask = df.xs('df1', axis=1).eq(df.xs('df2', axis=1))
df1 = pd.DataFrame(np.where(mask, 'N','Y'),
index=mask.index,
columns=pd.MultiIndex.from_product([['valc'], mask.columns]))
print (df1)
valc
key1 key2 key3 key4 key5
ref
1 N N N N N
2 N Y N N N
3 N N N Y N
5 Y Y Y N N
Join together and sorting columns:连接在一起并对列进行排序:
df = pd.concat([df, df1], axis=1).sort_index(axis=1, level=[1,0])
Remove equal rows by DataFrame.all
and inverted mask with ~
:通过
DataFrame.all
删除相等的行并使用~
反转掩码:
df = df[~mask.all(axis=1)]
print (df)
df1 df2 valc df1 df2 valc df1 df2 valc df1 df2 valc df1 \
key1 key1 key1 key2 key2 key2 key3 key3 key3 key4 key4 key4 key5
ref
2 vk21 vk21 N vk22 vkkk Y vk23 vk23 N vk24 vk24 N vk25
3 vk31 vk31 N vk32 vk32 N vk33 vk33 N vk34 vkkk Y vk35
5 vk51 vkkk Y vk52 vkkk Y vk53 vkkk Y vk54 vk54 N vk55
df2 valc
key5 key5
ref
2 vk25 N
3 vk35 N
5 vk55 N
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.