简体   繁体   中英

R: how to check if all columns in a data.frame are the same

> df = data.frame(A = c(1, 2, 3), B = c(3, 2, 2), C = c(3, 2, 1)); df
  A B C
1 1 3 3
2 2 2 2
3 3 2 1
> df2 = data.frame(A = c(1, 2, 3), B = c(1, 2, 3), C = c(1, 2, 3)); df2
  A B C
1 1 1 1
2 2 2 2
3 3 3 3

I want to know if all the columns in my data.frame are the same. For df , it should be FALSE, whereas for df2 it should be TRUE.

You could check if the number of unique variable vectors is equal to one:

length(unique(as.list(df))) == 1
# [1] FALSE
length(unique(as.list(df2))) == 1
# [1] TRUE

Another way could be to check if each variable is identical to the first variable:

all(sapply(df, identical, df[,1]))
# [1] FALSE
all(sapply(df2, identical, df2[,1]))
# [1] TRUE

You can also check it using 'all.equal'.

sapply(2:ncol(df),function(x) isTRUE(all.equal(df[,x-1],df[,x])))
[1] FALSE FALSE

sapply(2:ncol(df2),function(x) isTRUE(all.equal(df2[,x-1],df2[,x])))
[1] TRUE TRUE

Here is a new handy update to this relatively old question:

You can use the function all_equal from the package dplyr . The function returns TRUE if the two data frames are identical, otherwise a character vector describing the reasons why they are not equal.

Here are some more information: https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/all_equal

Perhaps worth mentioning the speed difference between the two solutions by josliber . The length(unique(..)) solution is the winner with small data, while all(sapply(...)) wins with large data.

df = data.frame(A = c(1, 2, 3), B = c(3, 2, 2), C = c(3, 2, 1))
df2 = data.frame(A = c(1, 2, 3), B = c(1, 2, 3), C = c(1, 2, 3))
# enlarge:
# df = do.call("rbind", replicate(10000, df, simplify = FALSE))
# df2 = do.call("rbind", replicate(10000, df2, simplify = FALSE))

microbenchmark::microbenchmark(
    uniq1 =
        {
            length(unique(as.list(df))) == 1
        },
    uniq2 =
        {
            length(unique(as.list(df2))) == 1
        },
    ident1 =
        {
            all(sapply(df, identical, df[,1]))
        },
    ident2 =
        {
            all(sapply(df2, identical, df2[,1]))
        }
)

# small:
Unit: microseconds
   expr    min      lq     mean  median      uq     max neval cld
  uniq1  4.243  4.5975  5.41435  5.0620  5.3685  19.852   100  a 
  uniq2  4.337  4.6425  5.80585  5.1340  5.3920  31.652   100  a 
 ident1 24.476 25.0100 28.22507 25.4255 26.4865 157.661   100   b
 ident2 24.558 25.0380 28.08906 25.5215 26.6605  76.284   100   b

# large:
Unit: microseconds
   expr     min       lq      mean   median       uq     max neval  cld
  uniq1 529.882 531.1020 537.98098 532.9360 538.0695 628.057   100   c 
  uniq2 872.855 874.7085 893.56305 884.1715 903.2400 987.257   100    d
 ident1  25.004  26.2735  29.68082  27.7770  29.1075  55.286   100 a   
 ident2 369.629 371.1610 379.34730 372.6670 379.2495 455.276   100  b 

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM