简体   繁体   English

查找两个数据框之间的差异

[英]Find difference between two data frames

I have two data frames df1 and df2, where df2 is a subset of df1.我有两个数据帧 df1 和 df2,其中 df2 是 df1 的子集。 How do I get a new data frame (df3) which is the difference between the two data frames?我如何获得一个新的数据帧(df3),这是两个数据帧之间的区别?

In other word, a data frame that has all the rows/columns in df1 that are not in df2?换句话说,一个数据框包含 df1 中不在 df2 中的所有行/列?

在此处输入图像描述

By using drop_duplicates通过使用drop_duplicates

pd.concat([df1,df2]).drop_duplicates(keep=False)

Update :

Above method only working for those dataframes they do not have duplicate itself, For example

df1=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})
df2=pd.DataFrame({'A':[1],'B':[2]})

It will output like below , which is wrong它会输出如下,这是错误的

Wrong Output :错误的输出:

pd.concat([df1, df2]).drop_duplicates(keep=False)
Out[655]: 
   A  B
1  2  3

Correct Output正确的输出

Out[656]: 
   A  B
1  2  3
2  3  4
3  3  4

How to achieve that?如何做到这一点?

Method 1: Using isin with tuple方法 1:将isintuple

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]
Out[657]: 
   A  B
1  2  3
2  3  4
3  3  4

Method 2: merge with indicator方法二:与indicator merge

df1.merge(df2,indicator = True, how='left').loc[lambda x : x['_merge']!='both']
Out[421]: 
   A  B     _merge
1  2  3  left_only
2  3  4  left_only
3  3  4  left_only

For rows, try this, where Name is the joint index column (can be a list for multiple common columns, or specify left_on and right_on ):对于行,试试这个,其中Name是联合索引列(可以是多个公共列的列表,或指定left_onright_on ):

m = df1.merge(df2, on='Name', how='outer', suffixes=['', '_'], indicator=True)

The indicator=True setting is useful as it adds a column called _merge , with all changes between df1 and df2 , categorized into 3 possible kinds: "left_only", "right_only" or "both". indicator=True设置很有用,因为它添加了一个名为_merge的列,其中包含df1df2之间的所有更改,分为 3 种可能的类型:“left_only”、“right_only”或“both”。

For columns, try this:对于列,试试这个:

set(df1.columns).symmetric_difference(df2.columns)

Accepted answer Method 1 will not work for data frames with NaNs inside, as pd.np.nan != pd.np.nan .接受的答案方法 1 不适用于内部包含 NaN 的数据帧,如pd.np.nan != pd.np.nan I am not sure if this is the best way, but it can be avoided by我不确定这是否是最好的方法,但可以通过以下方式避免

df1[~df1.astype(str).apply(tuple, 1).isin(df2.astype(str).apply(tuple, 1))]

It's slower, because it needs to cast data to string, but thanks to this casting pd.np.nan == pd.np.nan .它更慢,因为它需要将数据转换为字符串,但多亏了这种转换pd.np.nan == pd.np.nan

Let's go trough the code.让我们来看看代码。 First we cast values to string, and apply tuple function to each row.首先我们将值转换为字符串,并将tuple函数应用于每一行。

df1.astype(str).apply(tuple, 1)
df2.astype(str).apply(tuple, 1)

Thanks to that, we get pd.Series object with list of tuples.多亏了这一点,我们得到了带有元组列表的pd.Series对象。 Each tuple contains whole row from df1 / df2 .每个元组包含来自df1 / df2整行。 Then we apply isin method on df1 to check if each tuple "is in" df2 .然后我们在df1上应用isin方法来检查每个元组是否“在” df2 The result is pd.Series with bool values.结果是带有 bool 值的pd.Series True if tuple from df1 is in df2 .如果来自df1元组在df2则为真。 In the end, we negate results with ~ sign, and applying filter on df1 .最后,我们用~符号否定结果,并在df1上应用过滤器。 Long story short, we get only those rows from df1 that are not in df2 .长话短说,我们只从df1中获取那些不在df2

To make it more readable, we may write it as:为了使其更具可读性,我们可以将其写为:

df1_str_tuples = df1.astype(str).apply(tuple, 1)
df2_str_tuples = df2.astype(str).apply(tuple, 1)
df1_values_in_df2_filter = df1_str_tuples.isin(df2_str_tuples)
df1_values_not_in_df2 = df1[~df1_values_in_df2_filter]

edit2, I figured out a new solution without the need of setting index edit2,我想出了一个不需要设置索引的新解决方案

newdf=pd.concat([df1,df2]).drop_duplicates(keep=False)

Okay i found the answer of highest vote already contain what I have figured out.好的,我发现最高投票的答案已经包含了我所想的。 Yes, we can only use this code on condition that there are no duplicates in each two dfs.是的,我们只能在每两个 dfs 中没有重复的条件下使用此代码。


I have a tricky method.我有一个棘手的方法。 First we set 'Name' as the index of two dataframe given by the question.首先,我们将“名称”设置为问题给出的两个数据帧的索引。 Since we have same 'Name' in two dfs, we can just drop the 'smaller' df's index from the 'bigger' df.由于我们在两个 dfs 中有相同的“名称”,我们可以从“较大”的 df 中删除“较小”的 df 索引。 Here is the code.这是代码。

df1.set_index('Name',inplace=True)
df2.set_index('Name',inplace=True)
newdf=df1.drop(df2.index)
import pandas as pd
# given
df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})
df2 = pd.DataFrame({'Name':['John','Smith','Wale','Tom','Menda','Yuswa',],
    'Age':[23,12,34,44,28,40]})

# find elements in df1 that are not in df2
df_1notin2 = df1[~(df1['Name'].isin(df2['Name']) & df1['Age'].isin(df2['Age']))].reset_index(drop=True)

# output:
print('df1\n', df1)
print('df2\n', df2)
print('df_1notin2\n', df_1notin2)

# df1
#     Age   Name
# 0   23   John
# 1   45   Mike
# 2   12  Smith
# 3   34   Wale
# 4   27  Marry
# 5   44    Tom
# 6   28  Menda
# 7   39   Bolt
# 8   40  Yuswa
# df2
#     Age   Name
# 0   23   John
# 1   12  Smith
# 2   34   Wale
# 3   44    Tom
# 4   28  Menda
# 5   40  Yuswa
# df_1notin2
#     Age   Name
# 0   45   Mike
# 1   27  Marry
# 2   39   Bolt

Perhaps a simpler one-liner, with identical or different column names.也许是更简单的单行,具有相同或不同的列名。 Worked even when df2['Name2'] contained duplicate values.即使 df2['Name2'] 包含重复值也能工作。

newDf = df1.set_index('Name1')
           .drop(df2['Name2'], errors='ignore')
           .reset_index(drop=False)

In addition to accepted answer, I would like to propose one more wider solution that can find a 2D set difference of two dataframes with any index / columns (they might not coincide for both datarames).除了公认的答案之外,我还想提出一个更广泛的解决方案,它可以找到具有任何index / columns的两个数据帧的二维集差异(两个数据帧可能不重合)。 Also method allows to setup tolerance for float elements for dataframe comparison (it uses np.isclose )方法还允许为数据帧比较的float元素设置容差(它使用np.isclose


import numpy as np
import pandas as pd

def get_dataframe_setdiff2d(df_new: pd.DataFrame, 
                            df_old: pd.DataFrame, 
                            rtol=1e-03, atol=1e-05) -> pd.DataFrame:
    """Returns set difference of two pandas DataFrames"""

    union_index = np.union1d(df_new.index, df_old.index)
    union_columns = np.union1d(df_new.columns, df_old.columns)

    new = df_new.reindex(index=union_index, columns=union_columns)
    old = df_old.reindex(index=union_index, columns=union_columns)

    mask_diff = ~np.isclose(new, old, rtol, atol)

    df_bool = pd.DataFrame(mask_diff, union_index, union_columns)

    df_diff = pd.concat([new[df_bool].stack(),
                         old[df_bool].stack()], axis=1)

    df_diff.columns = ["New", "Old"]

    return df_diff

Example:例子:

In [1]

df1 = pd.DataFrame({'A':[2,1,2],'C':[2,1,2]})
df2 = pd.DataFrame({'A':[1,1],'B':[1,1]})

print("df1:\n", df1, "\n")

print("df2:\n", df2, "\n")

diff = get_dataframe_setdiff2d(df1, df2)

print("diff:\n", diff, "\n")
Out [1]

df1:
   A  C
0  2  2
1  1  1
2  2  2 

df2:
   A  B
0  1  1
1  1  1 

diff:
     New  Old
0 A  2.0  1.0
  B  NaN  1.0
  C  2.0  NaN
1 B  NaN  1.0
  C  1.0  NaN
2 A  2.0  NaN
  C  2.0  NaN 

As mentioned here that正如这里提到的那样

df1[~df1.apply(tuple,1).isin(df2.apply(tuple,1))]

is correct solution but it will produce wrong output if是正确的解决方案,但如果它会产生错误的输出

df1=pd.DataFrame({'A':[1],'B':[2]})
df2=pd.DataFrame({'A':[1,2,3,3],'B':[2,3,4,4]})

In that case above solution will give Empty DataFrame , instead you should use concat method after removing duplicates from each datframe.在这种情况下,上述解决方案将提供Empty DataFrame ,而您应该在从每个数据帧中删除重复项后使用concat方法。

Use concate with drop_duplicatesconcate with drop_duplicates

df1=df1.drop_duplicates(keep="first") 
df2=df2.drop_duplicates(keep="first") 
pd.concat([df1,df2]).drop_duplicates(keep=False)

Pandas now offers a new API to do data frame diff: pandas.DataFrame.compare Pandas 现在提供了一个新的 API来做数据帧差异: pandas.DataFrame.compare

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

不错的@lianli 解决方案的一个细微变化,不需要更改现有数据帧的索引:

newdf = df1.drop(df1.join(df2.set_index('Name').index))

Finding difference by index.按索引查找差异。 Assuming df1 is a subset of df2 and the indexes are carried forward when subsetting假设 df1 是 df2 的子集,并且在子集化时将索引结转

df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

# Example

df1 = pd.DataFrame({"gender":np.random.choice(['m','f'],size=5), "subject":np.random.choice(["bio","phy","chem"],size=5)}, index = [1,2,3,4,5])

df2 =  df1.loc[[1,3,5]]

df1

 gender subject
1      f     bio
2      m    chem
3      f     phy
4      m     bio
5      f     bio

df2

  gender subject
1      f     bio
3      f     phy
5      f     bio

df3 = df1.loc[set(df1.index).symmetric_difference(set(df2.index))].dropna()

df3

  gender subject
2      m    chem
4      m     bio

I had issues with handling duplicates when there were duplicates on one side and at least one on the other side, so I used Counter.collections to do a better diff, ensuring both sides have the same count.当一侧有重复项而另一侧至少有一个重复项时,我在处理重复项时遇到了问题,所以我使用Counter.collections来做一个更好的差异,确保双方具有相同的计数。 This doesn't return duplicates, but it won't return any if both sides have the same count.这不会返回重复项,但如果双方的计数相同,则不会返回任何重复项。

from collections import Counter

def diff(df1, df2, on=None):
    """
    :param on: same as pandas.df.merge(on) (a list of columns)
    """
    on = on if on else df1.columns
    df1on = df1[on]
    df2on = df2[on]
    c1 = Counter(df1on.apply(tuple, 'columns'))
    c2 = Counter(df2on.apply(tuple, 'columns'))
    c1c2 = c1-c2
    c2c1 = c2-c1
    df1ondf2on = pd.DataFrame(list(c1c2.elements()), columns=on)
    df2ondf1on = pd.DataFrame(list(c2c1.elements()), columns=on)
    df1df2 = df1.merge(df1ondf2on).drop_duplicates(subset=on)
    df2df1 = df2.merge(df2ondf1on).drop_duplicates(subset=on)
    return pd.concat([df1df2, df2df1])
> df1 = pd.DataFrame({'a': [1, 1, 3, 4, 4]})
> df2 = pd.DataFrame({'a': [1, 2, 3, 4, 4]})
> diff(df1, df2)
   a
0  1
0  2

I found the deepdiff library is a wonderful tool that also extends well to dataframes if different detail is required or ordering matters.我发现deepdiff库是一个很棒的工具,如果需要不同的细节或排序问题,它也可以很好地扩展到数据帧。 You can experiment with diffing to_dict('records') , to_numpy() , and other exports:您可以尝试to_dict('records')to_numpy()和其他导出:

import pandas as pd
from deepdiff import DeepDiff

df1 = pd.DataFrame({
    'Name':
        ['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
    'Age':
        [23,45,12,34,27,44,28,39,40]
})

df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])]

DeepDiff(df1.to_dict(), df2.to_dict())
# {'dictionary_item_removed': [root['Name'][1], root['Name'][4], root['Name'][7], root['Age'][1], root['Age'][4], root['Age'][7]]}

There is a new method in pandas DataFrame.compare that compare 2 different dataframes and return which values changed in each column for the data records. pandas DataFrame.compare中有一个新方法比较 2 个不同的数据帧并返回数据记录的每一列中更改的值。

Example例子

First Dataframe第一 Dataframe

Id Customer Status      Date
1      ABC   Good  Mar 2023
2      BAC   Good  Feb 2024
3      CBA    Bad  Apr 2022

Second Dataframe第二个 Dataframe

Id Customer Status      Date
1      ABC    Bad  Mar 2023
2      BAC   Good  Feb 2024
5      CBA   Good  Apr 2024

Comparing Dataframes比较数据帧

print("Dataframe difference -- \n")
print(df1.compare(df2))

print("Dataframe difference keeping equal values -- \n")
print(df1.compare(df2, keep_equal=True))

print("Dataframe difference keeping same shape -- \n")
print(df1.compare(df2, keep_shape=True))

print("Dataframe difference keeping same shape and equal values -- \n")
print(df1.compare(df2, keep_shape=True, keep_equal=True))

Result结果

Dataframe difference -- 

    Id       Status            Date          
  self other   self other      self     other
0  NaN   NaN   Good   Bad       NaN       NaN
2  3.0   5.0    Bad  Good  Apr 2022  Apr 2024

Dataframe difference keeping equal values -- 

    Id       Status            Date          
  self other   self other      self     other
0    1     1   Good   Bad  Mar 2023  Mar 2023
2    3     5    Bad  Good  Apr 2022  Apr 2024

Dataframe difference keeping same shape -- 

    Id       Customer       Status            Date          
  self other     self other   self other      self     other
0  NaN   NaN      NaN   NaN   Good   Bad       NaN       NaN
1  NaN   NaN      NaN   NaN    NaN   NaN       NaN       NaN
2  3.0   5.0      NaN   NaN    Bad  Good  Apr 2022  Apr 2024

Dataframe difference keeping same shape and equal values -- 

    Id       Customer       Status            Date          
  self other     self other   self other      self     other
0    1     1      ABC   ABC   Good   Bad  Mar 2023  Mar 2023
1    2     2      BAC   BAC   Good  Good  Feb 2024  Feb 2024
2    3     5      CBA   CBA    Bad  Good  Apr 2022  Apr 2024

Using the lambda function you can filter the rows with _merge value “left_only” to get all the rows in df1 which are missing from df2使用 lambda 函数,您可以过滤具有_merge“left_only”的行,以获取df1df2中缺少的所有行

df3 = df1.merge(df2, how = 'outer' ,indicator=True).loc[lambda x :x['_merge']=='left_only']
df

Defining our dataframes:定义我们的数据帧:

df1 = pd.DataFrame({
    'Name':
        ['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa'],
    'Age':
        [23,45,12,34,27,44,28,39,40]
})

df2 = df1[df1.Name.isin(['John','Smith','Wale','Tom','Menda','Yuswa'])

df1

    Name  Age
0   John   23
1   Mike   45
2  Smith   12
3   Wale   34
4  Marry   27
5    Tom   44
6  Menda   28
7   Bolt   39
8  Yuswa   40

df2

    Name  Age
0   John   23
2  Smith   12
3   Wale   34
5    Tom   44
6  Menda   28
8  Yuswa   40

The difference between the two would be:两者之间的区别是:

df1[~df1.isin(df2)].dropna()

    Name   Age
1   Mike  45.0
4  Marry  27.0
7   Bolt  39.0

Where:在哪里:

  • df1.isin(df2) returns the rows in df1 that are also in df2 . df1.isin(df2)返回df1中也在df2
  • ~ (Element-wise logical NOT) in front of the expression negates the results, so we get the elements in df1 that are NOT in df2 –the difference between the two. ~ (逐元素逻辑NOT)在表达前否定的结果,所以我们得到在元件df1df2两者之间-the差。
  • .dropna() drops the rows with NaN presenting the desired output .dropna()删除NaN表示所需输出的行

Note This only works if len(df1) >= len(df2) .注意这仅在len(df1) >= len(df2) If df2 is longer than df1 you can reverse the expression: df2[~df2.isin(df1)].dropna()如果df2df1长,您可以反转表达式: df2[~df2.isin(df1)].dropna()

Symmetric Difference对称差异

If you are interested in the rows that are only in one of the dataframes but not both, you are looking for the set difference:如果您对仅在一个数据框中而不是两个数据框中的行感兴趣,则您正在寻找集合差异:

pd.concat([df1,df2]).drop_duplicates(keep=False)

⚠️ Only works, if both dataframes do not contain any duplicates. ⚠️ 仅当两个数据帧不包含任何重复项时才有效。

Set Difference / Relational Algebra Difference集差/关系代数差

If you are interested in the relational algebra difference / set difference, ie df1-df2 or df1\df2 :如果您对关系代数差异/集合差异感兴趣,即df1-df2df1\df2

pd.concat([df1,df2,df2]).drop_duplicates(keep=False) 

⚠️ Only works, if both dataframes do not contain any duplicates. ⚠️ 仅当两个数据帧不包含任何重复项时才有效。

Try this one: df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)试试这个: df_new = df1.merge(df2, how='outer', indicator=True).query('_merge == "left_only"').drop('_merge', 1)

It will result a new dataframe with the differences: the values that exist in df1 but not in df2.它将产生一个新的 dataframe,但不同之处在于:df1 中存在但 df2 中不存在的值。

Another possible solution is to use numpy broadcasting :另一种可能的解决方案是使用numpy broadcasting

df1[np.all(~np.all(df1.values == df2.values[:, None], axis=2), axis=0)]

Output: Output:

    Name  Age
1   Mike   45
4  Marry   27
7   Bolt   39

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM