简体   繁体   中英

Compare 2 dataframes for equality in R

I have 2 dataframes with 2 same columns. I want to check if the datasets are identical. The original datasets have some 700K records but I'm trying to figure out a way to do it using dummy datasets

I tried using compare, identical, all, all_equal etc. None of them returns me True.

The dummy datasets are -

a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)

all(a==c)
[1] FALSE

compare(a,c)
FALSE [FALSE, FALSE]

identical(a,c)
[1] FALSE

 all.equal(a,c)
[1] "Component “x”: Mean relative difference: 0.9090909" "Component “b”: Mean relative difference: 0.3225806"

The datasets are entirely same, except for the order of the records. If these functions only work when the datasets are mirror images of each other, then I must try something else. If that is the case, can someone help with how do I get True for these 2 datasets (unordered)

dplyr 's setdiff works on data frames, I would suggest

library(dplyr)
nrow(setdiff(a, c)) == 0 & nrow(setdiff(c, a)) == 0
# [1] TRUE

Note that this will not account for number of duplicate rows . (ie, if a has multiple copies of a row, and c has only one copy of that row, it will still return TRUE ). Not sure how you want duplicate rows handled...

If you do care about having the same number of duplicates, then I would suggest two possibilities: (a) adding an ID column to differentiate the duplicates and using the approach above, or (b) sorting, resetting the row names (annoyingly), and using identical .

(a) adding an ID column

library(dplyr)
a_id = group_by_all(a) %>% mutate(id = row_number())
c_id = group_by_all(c) %>% mutate(id = row_number())
nrow(setdiff(a_id, c_id)) == 0 & nrow(setdiff(c_id, a_id)) == 0
# [1] TRUE

(b) sorting

a_sort = a[do.call(order, a), ]
row.names(a_sort) = NULL
c_sort = c[do.call(order, c), ]
row.names(c_sort) = NULL
identical(a_sort, c_sort)
# [1] TRUE

Maybe a function to sort the columns before comparison is what you need. But it will be slow on large dataframes.

unordered_equal <- function(X, Y, exact = FALSE){
  X[] <- lapply(X, sort)
  Y[] <- lapply(Y, sort)
  if(exact) identical(X, Y) else all.equal(X, Y)
}

unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] TRUE

a$x <- a$x + .Machine$double.eps
unordered_equal(a, c)
#[1] TRUE
unordered_equal(a, c, TRUE)
#[1] FALSE

Basically what you want may be to compare the ordered underlying matrices.

all.equal(matrix(unlist(a[order(a[1]), ]), dim(a)),
          matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE
identical(matrix(unlist(a[order(a[1]), ]), dim(a)),
          matrix(unlist(c[order(c[1]), ]), dim(c)))
# [1] TRUE

You could wrap this into a function for more convenience:

om <- function(d) matrix(unlist(d[order(d[1]), ]), dim(d))

all.equal(om(a), om(c))
# [1] TRUE

You can use the new package called waldo

library(waldo)
a <- data.frame(x = 1:10, b = 20:11)
c <- data.frame(x = 10:1, b = 11:20)

compare(a,c)

And you get:

`old$x`: 1 2 3 4 5 6 7 8 9 10 and 9 more...
`new$x`:                   10           ...

`old$b`: 20 19 18 17 16 15 14 13 12 11 and 9 more...
`new$b`: 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM