简体   繁体   English

比较 R 中的 2 个数据帧的相等性

[英]Compare 2 dataframes for equality in R

I have 2 dataframes with 2 same columns.我有 2 个具有 2 个相同列的数据框。 I want to check if the datasets are identical.我想检查数据集是否相同。 The original datasets have some 700K records but I'm trying to figure out a way to do it using dummy datasets原始数据集有大约 700K 条记录,但我正试图找出一种使用虚拟数据集的方法

I tried using compare, identical, all, all_equal etc. None of them returns me True.我尝试使用比较、相同、全部、all_equal 等。它们都没有返回 True。

The dummy datasets are -虚拟数据集是 -

a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)

all(a==c)
[1] FALSE

compare(a,c)
FALSE [FALSE, FALSE]

identical(a,c)
[1] FALSE

 all.equal(a,c)
[1] "Component “x”: Mean relative difference: 0.9090909" "Component “b”: Mean relative difference: 0.3225806"

The datasets are entirely same, except for the order of the records.除了记录的顺序外,数据集完全相同。 If these functions only work when the datasets are mirror images of each other, then I must try something else.如果这些功能只在数据集是彼此的镜像时才起作用,那么我必须尝试其他方法。 If that is the case, can someone help with how do I get True for these 2 datasets (unordered)如果是这种情况,有人可以帮助我如何为这 2 个数据集获得 True(无序)

dplyr 's setdiff works on data frames, I would suggest dplyrsetdiff适用于数据框,我建议

library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE

Note that this will not account for number of duplicate rows .请注意,这不会考虑重复行的数量 (ie, if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE ). (即,如果a有一行的多个副本,而c只有该行的一个副本,它仍将返回TRUE )。 Not sure how you want duplicate rows handled...不确定您希望如何处理重复的行...

If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical .如果您确实关心具有相同数量的重复项,那么我建议两种可能性:(a)添加一个 ID 列来区分重复项并使用上述方法,或(b)排序,重置行名称(烦人),并使用identical .

(a) adding an ID column (a)添加 ID 列

library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE

(b) sorting (b)排序

a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE

Maybe a function to sort the columns before comparison is what you need.也许您需要在比较之前对列进行排序的函数。 But it will be slow on large dataframes.但是在大型数据帧上会很慢。

unordered_equal <- function(X, Y, exact = FALSE){
  X[] <- lapply(X, sort)
  Y[] <- lapply(Y, sort)
  if(exact) identical(X, Y) else all.equal(X, Y)
}

unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] TRUE

a$x <- a$x + .Machine$double.eps
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] FALSE

Basically what you want may be to compare the ordered underlying matrices.基本上你想要的可能是比较有序的基础矩阵。

all.equal(matrix(unlist(a[order(a[1]), ]), dim(a)),
          matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
identical(matrix(unlist(a[order(a[1]), ]), dim(a)),
          matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE

You could wrap this into a function for more convenience:为了更方便,您可以将其包装到一个函数中:

om <- function(d) matrix(unlist(d[order(d[1]), ]), dim(d))

all.equal(om(a), om(c))
# [1] TRUE

You can use the new package called waldo您可以使用名为 waldo 的新包

library(waldo)
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)

compare(a,c)

And you get:你会得到:

`old$x`: 1 2 3 4 5 6 7 8 9 10 and 9 more...
`new$x`:                   10           ...

`old$b`: 20 19 18 17 16 15 14 13 12 11 and 9 more...
`new$b`: 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM