Python - Finding Row Discrepancies Between Two Dataframes

Question

I have two dataframes with the same number of columns, d1 and d2.

NOTE: d1 and d2 may have different number of rows. NOTE: d1 and d2 may not be indexed to the same row in each data frame.

What is the best way to check whether or not the two dataframes have the same data?

My current solution consists of appending the two dataframes together and dropping any rows that match.

d_combined = d1.append(d2)
d_discrepancy = d_combined.drop_duplicates(keep=False)
print(d_discrepancy)

I am new to python and the pandas library. Because I will be using dataframes with millions of rows and 8-10 columns, is there a faster and more efficient way to check for discrepancies? Can it also be shown which initial dataframe the resulting discrepancy row is from?

Answer 1

Setup

d1 = pd.DataFrame(dict(A=[1, 2, 3, 4]))
d2 = pd.DataFrame(dict(A=[2, 3, 4, 5]))

Option 1
Use pd.merge . I'll include the parameter indicator=True to show where the data came from.

d1.merge(d2, how='outer', indicator=True)

   A      _merge
0  1   left_only
1  2        both
2  3        both
3  4        both
4  5  right_only

If they have the same data, I'd expect that the _merge column would be both for everything. So we can check with

d1.merge(d2, how='outer', indicator=True)._merge.eq('both').all()

False

In this case, it returned False therefore not the same data.

Option 2
Use drop_duplicates
You need to make sure you drop the duplicates from the initial dataframes first.

d1.drop_duplicates().append(d2.drop_duplicates()) \
    .drop_duplicates(keep=False).empty

Python - Finding Row Discrepancies Between Two Dataframes

Question

1 answers

solution1
3 2017-08-14 22:29:42

Python - Finding Row Discrepancies Between Two Dataframes

Question

1 answers

solution1 3 2017-08-14 22:29:42

solution1
3 2017-08-14 22:29:42