[英]How to find the set difference between two Pandas DataFrames
I'd like to check the difference between two DataFrame columns. 我想检查两个DataFrame列之间的区别。 I tried using the command:
我尝试使用以下命令:
np.setdiff1d(train.columns, train_1.columns)
which results in an empty array: 这将导致一个空数组:
array([], dtype=object)
However, the number of columns in the dataframes are different: 但是,数据框中的列数是不同的:
len(train.columns), len(train_1.columns) = (51, 56)
which means that the two DataFrame are obviously different. 这意味着两个DataFrame明显不同。
What is wrong here? 怎么了
The results are correct, however, setdiff1d
is order dependent. 结果是正确的,但是
setdiff1d
与顺序有关。 It will only check for elements in the first input array that do not occur in the second array. 它将仅检查第二个数组中未出现的第一个输入数组中的元素。
If you do not care which of the dataframes have the unique columns you can use setxor1d
. 如果您不关心哪个数据
setxor1d
具有唯一列,则可以使用setxor1d
。 It will return "the unique values that are in only one (not both) of the input arrays", see the documentation . 它将返回“仅在输入数组之一(不是两个)中的唯一值”,请参阅文档 。
import numpy
colsA = ['a', 'b', 'c', 'd']
colsB = ['b','c']
c = numpy.setxor1d(colsA, colsB)
Will return you an array containing 'a' and 'd'. 将返回一个包含“ a”和“ d”的数组。
If you want to use setdiff1d
you need to check for differences both ways: 如果要使用
setdiff1d
,则需要两种方式检查差异:
//columns in train.columns that are not in train_1.columns
c1 = np.setdiff1d(train.columns, train_1.columns)
//columns in train_1.columns that are not in train.columns
c2 = np.setdiff1d(train_1.columns, train.columns)
use something like this 用这样的东西
data_3 = data1[~data1.isin(data2)]
Where data1 and data2 are columns and data_3 = data_1 - data_2 其中data1和data2是列,而data_3 = data_1-data_2
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.