简体   繁体   English

Python:如何比较两个数据框

[英]Python : How to compare two data frames

I have two data frames: 我有两个数据框:

df1

A1    B1
1     a
2     s
3     d

and

df2

A1    B1
1     a
2     x
3     d

I want to compare df1 and df2 on column B1. 我想比较B1列上的df1和df2。 The column A1 can be used to join. 列A1可用于联接。 I want to know: 我想知道:

  1. Which rows are different in df1 and df2 with respect to column B1? 相对于列B1,df1和df2中哪些行不同?
  2. If there is a mismatch in the values of column A1. 如果A1列的值不匹配。 For example whether df2 is missing some values that are there in df1 and vice versa. 例如,df2是否缺少df1中存在的某些值,反之亦然。 And if so, which ones? 如果是这样,哪个?

I tried using merge and join but that is not what I am looking for. 我尝试使用合并和联接,但这不是我想要的。

I've edited the raw data to illustrate the case of A1 keys in one dataframe but not the other. 我已经编辑了原始数据,以说明一个数据帧中A1键的情况,而不是其他数据帧。

When doing your merge, you want to specify an 'outer' merge so that you can see these items with an A1 key in one dataframe but not the other. 进行合并时,您想指定一个“外部”合并,这样您就可以在一个数据框中看到带有A1键的这些项目,而在另一个数据框中则看不到。

I've included the suffixes '_1' and '_2' to indicate the dataframe source (_1 = df1 and _2 = df2 ) of column B1 . 我添加了后缀'_1'和'_2'来指示列B1的数据帧源(_1 = df1和_2 = df2 )。

df1 = pd.DataFrame({'A1': [1, 2, 3, 4], 'B1': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'A1': [1, 2, 3, 5], 'B1': ['a', 'd', 'c', 'e']})

df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'])
df3['check'] = df3.B1_1 == df3.B1_2

>>> df3
   A1 B1_1 B1_2  check
0   1    a    a   True
1   2    b    d  False
2   3    c    c   True
3   4    d  NaN  False
4   5  NaN    e  False

To check for missing A1 keys in df1 and df2 : 要检查df1df2是否缺少A1键:

# A1 value missing in `df1`
>>> d3[df3.B1_1.isnull()]
   A1 B1_1 B1_2  check
4   5  NaN    e  False

# A1 value missing in `df2`
>>> df3[df3.B1_2.isnull()]
   A1 B1_1 B1_2  check
3   4    d  NaN  False

EDIT Thanks to @EdChum (the source of all Pandas knowledge...). 编辑感谢@EdChum(所有熊猫知识的来源...)。

df3 = df1.merge(df2, how='outer', on='A1', suffixes=['_1', '_2'], indicator=True)
df3['check'] = df3.B1_1 == df3.B1_2

>>> df3
   A1 B1_1 B1_2      _merge  check
0   1    a    a        both   True
1   2    b    d        both  False
2   3    c    c        both   True
3   4    d  NaN   left_only  False
4   5  NaN    e  right_only  False

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM