pandas中两个数据帧之间的差异

Question

我有两个数据帧，它们都具有相同的基本架构。 （4个日期字段，几个字符串字段和4-5个浮点字段）。 称他们为df1和df2 。

我想要做的是基本上得到两者的“差异” - 我在那里找回两个数据帧之间没有共享的所有行（不在集合交集中）。 注意，两个数据帧的长度不必相同。

我尝试使用pandas.merge(how='outer')但我不确定要传递哪个列作为'key'，因为实际上没有一个，我试过的各种组合都不起作用。 df1或df2可能有两个（或更多）相同的行。

在pandas / Python中执行此操作的好方法是什么？

Answer 1

IIUC：
您可以使用pd.Index.symmetric_difference

pd.concat([df1, df2]).loc[
    df1.index.symmetric_difference(df2.index)
]

Answer 2

试试这个：

diff_df = pd.merge(df1, df2, how='outer', indicator='Exist')

diff_df = diff_df.loc[diff_df['Exist'] != 'both']

您将拥有df1和df2上不存在的所有行的数据框。

Answer 3

设置df2.columns = df1.columns
现在，将每列设置为索引： df1 = df1.set_index(df1.columns.tolist()) ， df2类似。
您现在可以执行df1.index.difference(df2.index)和df2.index.difference(df1.index) ，这两个结果是您的不同列。

Answer 4

您可以使用此功能，输出是6个数据帧的有序字典，您可以将其写入Excel以进行进一步分析。

'df1'和'df2'指的是输入数据帧。
'uid'指的是构成唯一键的列或列组合。 （即'水果'）
'dedupe'（默认= True）会在df1和df2中删除重复项。 （参见评论中的第4步）
'labels'（默认=（'df1'，'df2'））允许您命名输入数据帧。 如果两个数据框中都存在一个唯一键，但在一列或多列中有不同的值，那么通常很重要的是要知道这些行，将它们放在另一行的顶部并使用名称标记该行，以便我们知道哪个数据帧它属于。
在考虑差异时，'drop'可以列出要从考虑中排除的列

开始：

df1 = pd.DataFrame([['apple', '1'], ['banana', 2], ['coconut',3]], columns=['Fruits','Quantity'])
df2 = pd.DataFrame([['apple', '1'], ['banana', 3], ['durian',4]], columns=['Fruits','Quantity'])
dict1 = diff_func(df1, df2, 'Fruits')

In [10]: dict1['df1_only']:
Out[10]:
    Fruits Quantity
1  coconut        3

In [11]: dict1['df2_only']:
Out[11]:
   Fruits Quantity
3  durian        4

In [12]: dict1['Diff']:
Out[12]:
   Fruits Quantity df1 or df2
0  banana        2        df1
1  banana        3        df2

In [13]: dict1['Merge']:
Out[13]:
  Fruits Quantity
0  apple        1

这是代码：

import pandas as pd
from collections import OrderedDict as od

def diff_func(df1, df2, uid, dedupe=True, labels=('df1', 'df2'), drop=[]):
    dict_df = {labels[0]: df1, labels[1]: df2}
    col1 = df1.columns.values.tolist()
    col2 = df2.columns.values.tolist()

    # There could be columns known to be different, hence allow user to pass this as a list to be dropped.
    if drop:
        print ('Ignoring columns {} in comparison.'.format(', '.join(drop)))
        col1 = list(filter(lambda x: x not in drop, col1))
        col2 = list(filter(lambda x: x not in drop, col2))
        df1 = df1[col1]
        df2 = df2[col2]


    # Step 1 - Check if no. of columns are the same:
    len_lr = len(col1), len(col2)
    assert len_lr[0]==len_lr[1], \
    'Cannot compare frames with different number of columns: {}.'.format(len_lr)

    # Step 2a - Check if the set of column headers are the same
    #           (order doesnt matter)
    assert set(col1)==set(col2), \
    'Left column headers are different from right column headers.' \
       +'\n   Left orphans: {}'.format(list(set(col1)-set(col2))) \
       +'\n   Right orphans: {}'.format(list(set(col2)-set(col1)))

    # Step 2b - Check if the column headers are in the same order
    if col1 != col2:
        print ('[Note] Reordering right Dataframe...')
        df2 = df2[col1]

    # Step 3 - Check datatype are the same [Order is important]
    if set((df1.dtypes == df2.dtypes).tolist()) - {True}:
        print ('dtypes are not the same.')
        df_dtypes = pd.DataFrame({labels[0]:df1.dtypes,labels[1]:df2.dtypes,'Diff':(df1.dtypes == df2.dtypes)})
        df_dtypes = df_dtypes[df_dtypes['Diff']==False][[labels[0],labels[1],'Diff']]
        print (df_dtypes)
    else:
        print ('DataType check: Passed')

    # Step 4 - Check for duplicate rows
    if dedupe:
        for key, df in dict_df.items():
            if df.shape[0] != df.drop_duplicates().shape[0]:
                print(key + ': Duplicates exists, they will be dropped.')
                dict_df[key] = df.drop_duplicates()

    # Step 5 - Check for duplicate uids.
    if type(uid)==str or type(uid)==list:
        print ('Uniqueness check: {}'.format(uid))
        for key, df in dict_df.items():
            count_uid = df.shape[0]
            count_uid_unique = df[uid].drop_duplicates().shape[0]
            var = [0,1][count_uid_unique == df.shape[0]] #<-- Round off to the nearest integer if it is 100%
            pct = round(100*count_uid_unique/df.shape[0], var)
            print ('{}: {} out of {} are unique ({}%).'.format(key, count_uid_unique, count_uid, pct))

    # Checks complete, begin merge. '''Remenber to dedupe, provide labels for common_no_match'''
    dict_result = od()
    df_merge = pd.merge(df1, df2, on=col1, how='inner')
    if not df_merge.shape[0]:
        print ('Error: Merged DataFrame is empty.')
    else:
        dict_result[labels[0]] = df1
        dict_result[labels[1]] = df2
        dict_result['Merge'] = df_merge
        if type(uid)==str:
            uid = [uid]

        if type(uid)==list:
            df1_only = df1.append(df_merge).reset_index(drop=True)
            df1_only['Duplicated']=df1_only.duplicated(subset=uid, keep=False)  #keep=False, marks all duplicates as True
            df1_only = df1_only[df1_only['Duplicated']==False]
            df2_only = df2.append(df_merge).reset_index(drop=True)
            df2_only['Duplicated']=df2_only.duplicated(subset=uid, keep=False)
            df2_only = df2_only[df2_only['Duplicated']==False]

            label = labels[0]+' or '+labels[1]
            df_lc = df1_only.copy()
            df_lc[label] = labels[0]
            df_rc = df2_only.copy()
            df_rc[label] = labels[1]
            df_c = df_lc.append(df_rc).reset_index(drop=True)
            df_c['Duplicated'] = df_c.duplicated(subset=uid, keep=False)
            df_c1 = df_c[df_c['Duplicated']==True]
            df_c1 = df_c1.drop('Duplicated', axis=1)
            df_uc = df_c[df_c['Duplicated']==False]

            df_uc_left = df_uc[df_uc[label]==labels[0]]
            df_uc_right = df_uc[df_uc[label]==labels[1]]

            dict_result[labels[0]+'_only'] = df_uc_left.drop(['Duplicated', label], axis=1)
            dict_result[labels[1]+'_only'] = df_uc_right.drop(['Duplicated', label], axis=1)
            dict_result['Diff'] = df_c1.sort_values(uid).reset_index(drop=True)

    return dict_result

Answer 5

同

left_df.merge(df,left_on=left_df.columns.tolist(),right_on=df.columns.tolist(),how='outer')

你可以获得外连接结果。
类似地，你可以得到内连接结果。然后做一个你想要的差异。

pandas中两个数据帧之间的差异

问题描述

5 个解决方案

解决方案1
2 2017-11-06 08:42:07

解决方案2
2 2018-08-22 14:11:04

解决方案3
1 2017-11-06 07:00:00

解决方案4
0 2018-08-04 08:35:58

解决方案5
-2 2017-11-06 09:24:13

pandas中两个数据帧之间的差异

问题描述

5 个解决方案

解决方案1 2 2017-11-06 08:42:07

解决方案2 2 2018-08-22 14:11:04

解决方案3 1 2017-11-06 07:00:00

解决方案4 0 2018-08-04 08:35:58

解决方案5 -2 2017-11-06 09:24:13

解决方案1
2 2017-11-06 08:42:07

解决方案2
2 2018-08-22 14:11:04

解决方案3
1 2017-11-06 07:00:00

解决方案4
0 2018-08-04 08:35:58

解决方案5
-2 2017-11-06 09:24:13