How to find the set difference between two Pandas DataFrames

Question

I'd like to check the difference between two DataFrame columns. I tried using the command:

np.setdiff1d(train.columns, train_1.columns)

which results in an empty array:

array([], dtype=object)

However, the number of columns in the dataframes are different:

len(train.columns), len(train_1.columns) = (51, 56)

which means that the two DataFrame are obviously different.

What is wrong here?

Answer 1

The results are correct, however, setdiff1d is order dependent. It will only check for elements in the first input array that do not occur in the second array.

If you do not care which of the dataframes have the unique columns you can use setxor1d . It will return "the unique values that are in only one (not both) of the input arrays", see the documentation .

import numpy

colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']

c = numpy.setxor1d(colsA, colsB)

Will return you an array containing 'a' and 'd'.

If you want to use setdiff1d you need to check for differences both ways:

//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)

//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)

Answer 2

use something like this

data_3 = data1[~data1.isin(data2)]

Where data1 and data2 are columns and data_3 = data_1 - data_2

How to find the set difference between two Pandas DataFrames

Question

2 answers

solution1
1 ACCPTED 2017-10-06 05:40:18

solution2
1 2018-12-04 07:09:55

How to find the set difference between two Pandas DataFrames

Question

2 answers

solution1 1 ACCPTED 2017-10-06 05:40:18

solution2 1 2018-12-04 07:09:55

solution1
1 ACCPTED 2017-10-06 05:40:18

solution2
1 2018-12-04 07:09:55