I have two dataframes with the same number of columns, d1 and d2.
NOTE: d1 and d2 may have different number of rows. NOTE: d1 and d2 may not be indexed to the same row in each data frame.
What is the best way to check whether or not the two dataframes have the same data?
My current solution consists of appending the two dataframes together and dropping any rows that match.
d_combined = d1.append(d2)
d_discrepancy = d_combined.drop_duplicates(keep=False)
print(d_discrepancy)
I am new to python and the pandas library. Because I will be using dataframes with millions of rows and 8-10 columns, is there a faster and more efficient way to check for discrepancies? Can it also be shown which initial dataframe the resulting discrepancy row is from?
Setup
d1 = pd.DataFrame(dict(A=[1, 2, 3, 4]))
d2 = pd.DataFrame(dict(A=[2, 3, 4, 5]))
Option 1
Use pd.merge
. I'll include the parameter indicator=True
to show where the data came from.
d1.merge(d2, how='outer', indicator=True)
A _merge
0 1 left_only
1 2 both
2 3 both
3 4 both
4 5 right_only
If they have the same data, I'd expect that the _merge
column would be both
for everything. So we can check with
d1.merge(d2, how='outer', indicator=True)._merge.eq('both').all()
False
In this case, it returned False
therefore not the same data.
Option 2
Use drop_duplicates
You need to make sure you drop the duplicates from the initial dataframes first.
d1.drop_duplicates().append(d2.drop_duplicates()) \
.drop_duplicates(keep=False).empty
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.