简体   繁体   中英

Python - Finding Row Discrepancies Between Two Dataframes

I have two dataframes with the same number of columns, d1 and d2.

NOTE: d1 and d2 may have different number of rows. NOTE: d1 and d2 may not be indexed to the same row in each data frame.

What is the best way to check whether or not the two dataframes have the same data?

My current solution consists of appending the two dataframes together and dropping any rows that match.

d_combined = d1.append(d2)
d_discrepancy = d_combined.drop_duplicates(keep=False)
print(d_discrepancy)

I am new to python and the pandas library. Because I will be using dataframes with millions of rows and 8-10 columns, is there a faster and more efficient way to check for discrepancies? Can it also be shown which initial dataframe the resulting discrepancy row is from?

Setup

d1 = pd.DataFrame(dict(A=[1, 2, 3, 4]))
d2 = pd.DataFrame(dict(A=[2, 3, 4, 5]))

Option 1
Use pd.merge . I'll include the parameter indicator=True to show where the data came from.

d1.merge(d2, how='outer', indicator=True)

   A      _merge
0  1   left_only
1  2        both
2  3        both
3  4        both
4  5  right_only

If they have the same data, I'd expect that the _merge column would be both for everything. So we can check with

d1.merge(d2, how='outer', indicator=True)._merge.eq('both').all()

False

In this case, it returned False therefore not the same data.


Option 2
Use drop_duplicates
You need to make sure you drop the duplicates from the initial dataframes first.

d1.drop_duplicates().append(d2.drop_duplicates()) \
    .drop_duplicates(keep=False).empty

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM