简体   繁体   中英

How to efficiently count unique (numeric) column vectors of a data.table?

foo <- data.table(x = 1:5/sum(1:5),
                  y = (-4):0/sum((-4):0),
                 z1 = 2:6/sum(2:6),
                 z2 = 2:6/sum(2:6))

Suppose I have the foo data table (as specified above):

            x   y   z1   z2
1: 0.06666667 0.4 0.10 0.10
2: 0.13333333 0.3 0.15 0.15
3: 0.20000000 0.2 0.20 0.20
4: 0.26666667 0.1 0.25 0.25
5: 0.33333333 0.0 0.30 0.30

How to efficiently count unique columns? In this case only 3.

Please assume that in general:

  1. foo is always a data table and not a matrix; though the columns are always numeric.
  2. foo in reality is big, nrow > 20k and ncol > 100

Is it possible to do this without making extra copies of the data?

My current approach is to apply over columns with paste to get a single value for each column and then do length(unique(.)) on the result...

Is there any magic with data.table::transpose() , data.table::uniqueN , and maybe some other friends?

Another possibility:

length(unique(as.list(foo)))

Which gives the expected result:

 > length(unique(as.list(foo))) [1] 3 

NOTE: the use of length(unique()) is necessary as uniqueN() will return an error.

Per the comment of @Ryan, you can also do:

length(unique.default(foo))

With regard to speed, both methods are comparable (when measured on a larger dataset of 5M rows):

 > fooLarge <- foo[rep(1:nrow(foo),1e6)] > microbenchmark(length(unique.default(fooLarge)), length(unique(as.list(fooLarge)))) Unit: milliseconds expr min lq mean median uq max neval cld length(unique.default(fooLarge)) 94.0433 94.56920 95.24076 95.01492 95.67131 103.15433 100 a length(unique(as.list(fooLarge))) 94.0254 94.68187 95.17648 95.02672 95.49857 99.19411 100 a 

If you want to retain only the unique columns, you could use:

# option 1
cols <- !duplicated(as.list(foo))
foo[, ..cols]

# option 2 (doesn't retain the column names)
as.data.table(unique.default(foo))

which gives (output option 1 shown):

  xy z1 1: 0.06666667 0.4 0.10 2: 0.13333333 0.3 0.15 3: 0.20000000 0.2 0.20 4: 0.26666667 0.1 0.25 5: 0.33333333 0.0 0.30 

transpose and check for non-duplicates

ncol( foo[ , which( !duplicated( t( foo ) ) ), with = FALSE ])

3

Another method which may be faster if you expect a large number of duplicates:

n_unique_cols <- function(foo) {
  K <- seq_along(foo)
  for (j in seq_along(foo)) {
    if (j %in% K) {
      foo_j <- .subset2(foo, j)
      for (k in K) {
        if (j < k) {
          foo_k <- .subset2(foo, k)
          if (foo_j[1] == foo_k[1] && identical(foo_j, foo_k)) {
            K <- K[K != k]
          }
          rm(foo_k)
        }
      }
    }
  }
  length(K)
}

Timings:

library(data.table)
create_foo <- function(row, col) {
  foo <- data.table(x = rnorm(row), 
                    y = seq_len(row) - 2L)

  set.seed(1)
  for (k in seq_len(col %/% 2L)) {
    foo[, (paste0('x', k)) := x + sample(-4:4, size = 1)]
    foo[, (paste0('y', k)) := y + sample(-2:2, size = 1)]
  }
  foo
}

library(bench)
res <- 
  press(rows = c(1e5, 1e6, 1e7), 
        cols = c(10, 50, 100), 
        {

          foorc <- create_foo(rows, cols)
          bench::mark(n_unique_cols(foorc), 
                      length(unique(as.list(foorc))))
        })
plot(res)

For this family of data, this function is twice as fast, but its memory consumption grows faster than unique(as.list(.)) .

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM