如何有效地计算data.table的唯一（数字）列向量？

Question

foo <- data.table(x = 1:5/sum(1:5),
                  y = (-4):0/sum((-4):0),
                 z1 = 2:6/sum(2:6),
                 z2 = 2:6/sum(2:6))

Suppose I have the foo data table (as specified above): 假设我有foo数据表（如上所述）：

            x   y   z1   z2
1: 0.06666667 0.4 0.10 0.10
2: 0.13333333 0.3 0.15 0.15
3: 0.20000000 0.2 0.20 0.20
4: 0.26666667 0.1 0.25 0.25
5: 0.33333333 0.0 0.30 0.30

How to efficiently count unique columns? 如何有效地计算唯一列？ In this case only 3. 在这种情况下只有3。

Please assume that in general: 请假设一般：

foo is always a data table and not a matrix; foo始终是数据表而不是矩阵; though the columns are always numeric. 虽然列总是数字。
foo in reality is big, nrow > 20k and ncol > 100 实际上foo很大，nrow> 20k，ncol> 100

Is it possible to do this without making extra copies of the data? 是否可以在不制作额外数据副本的情况下执行此操作？

My current approach is to apply over columns with paste to get a single value for each column and then do length(unique(.)) on the result... 我目前的方法是在带有paste列上apply以获得每列的单个值，然后对结果执行length(unique(.)) ...

Is there any magic with data.table::transpose() , data.table::uniqueN , and maybe some other friends? data.table::transpose() ， data.table::uniqueN ，还有其他一些朋友，有什么魔力吗？

Answer 1

Another possibility: 另一种可能性

length(unique(as.list(foo)))

Which gives the expected result: 这给出了预期的结果：

 > length(unique(as.list(foo))) [1] 3

NOTE: the use of length(unique()) is necessary as uniqueN() will return an error. 注意：使用length(unique())是必要的，因为uniqueN()将返回错误。

Per the comment of @Ryan, you can also do: 根据@Ryan的评论，你也可以这样做：

length(unique.default(foo))

With regard to speed, both methods are comparable (when measured on a larger dataset of 5M rows): 关于速度，两种方法都是可比较的（当在更大的5M行数据集上测量时）：

 > fooLarge <- foo[rep(1:nrow(foo),1e6)] > microbenchmark(length(unique.default(fooLarge)), length(unique(as.list(fooLarge)))) Unit: milliseconds expr min lq mean median uq max neval cld length(unique.default(fooLarge)) 94.0433 94.56920 95.24076 95.01492 95.67131 103.15433 100 a length(unique(as.list(fooLarge))) 94.0254 94.68187 95.17648 95.02672 95.49857 99.19411 100 a

If you want to retain only the unique columns, you could use: 如果您只想保留唯一列，可以使用：

# option 1
cols <- !duplicated(as.list(foo))
foo[, ..cols]

# option 2 (doesn't retain the column names)
as.data.table(unique.default(foo))

which gives (output option 1 shown): 给出（显示输出选项1）：

  xy z1 1: 0.06666667 0.4 0.10 2: 0.13333333 0.3 0.15 3: 0.20000000 0.2 0.20 4: 0.26666667 0.1 0.25 5: 0.33333333 0.0 0.30

Answer 2

transpose and check for non-duplicates 转置并检查非重复

ncol( foo[ , which( !duplicated( t( foo ) ) ), with = FALSE ])

3

Answer 3

Another method which may be faster if you expect a large number of duplicates: 如果您期望大量重复项，可能会更快的另一种方法：

n_unique_cols <- function(foo) {
  K <- seq_along(foo)
  for (j in seq_along(foo)) {
    if (j %in% K) {
      foo_j <- .subset2(foo, j)
      for (k in K) {
        if (j < k) {
          foo_k <- .subset2(foo, k)
          if (foo_j[1] == foo_k[1] && identical(foo_j, foo_k)) {
            K <- K[K != k]
          }
          rm(foo_k)
        }
      }
    }
  }
  length(K)
}

Timings: 时序：

library(data.table)
create_foo <- function(row, col) {
  foo <- data.table(x = rnorm(row), 
                    y = seq_len(row) - 2L)

  set.seed(1)
  for (k in seq_len(col %/% 2L)) {
    foo[, (paste0('x', k)) := x + sample(-4:4, size = 1)]
    foo[, (paste0('y', k)) := y + sample(-2:2, size = 1)]
  }
  foo
}

library(bench)
res <- 
  press(rows = c(1e5, 1e6, 1e7), 
        cols = c(10, 50, 100), 
        {

          foorc <- create_foo(rows, cols)
          bench::mark(n_unique_cols(foorc), 
                      length(unique(as.list(foorc))))
        })
plot(res)

For this family of data, this function is twice as fast, but its memory consumption grows faster than unique(as.list(.)) . 对于这个数据系列，此函数的速度是其两倍，但其内存消耗增长速度快于unique(as.list(.)) 。

如何有效地计算data.table的唯一（数字）列向量？

问题描述

3 个解决方案

解决方案1
5 已采纳 2018-06-11 13:58:17

解决方案2
1 2018-06-11 13:39:52

解决方案3
1 2018-06-11 17:05:47

如何有效地计算data.table的唯一（数字）列向量？

问题描述

3 个解决方案

解决方案1 5 已采纳 2018-06-11 13:58:17

解决方案2 1 2018-06-11 13:39:52

解决方案3 1 2018-06-11 17:05:47

解决方案1
5 已采纳 2018-06-11 13:58:17

解决方案2
1 2018-06-11 13:39:52

解决方案3
1 2018-06-11 17:05:47