简体   繁体   English

如何有效地计算data.table的唯一(数字)列向量?

[英]How to efficiently count unique (numeric) column vectors of a data.table?

foo <- data.table(x = 1:5/sum(1:5),
                  y = (-4):0/sum((-4):0),
                 z1 = 2:6/sum(2:6),
                 z2 = 2:6/sum(2:6))

Suppose I have the foo data table (as specified above): 假设我有foo数据表(如上所述):

            x   y   z1   z2
1: 0.06666667 0.4 0.10 0.10
2: 0.13333333 0.3 0.15 0.15
3: 0.20000000 0.2 0.20 0.20
4: 0.26666667 0.1 0.25 0.25
5: 0.33333333 0.0 0.30 0.30

How to efficiently count unique columns? 如何有效地计算唯一列? In this case only 3. 在这种情况下只有3。

Please assume that in general: 请假设一般:

  1. foo is always a data table and not a matrix; foo始终是数据表而不是矩阵; though the columns are always numeric. 虽然列总是数字。
  2. foo in reality is big, nrow > 20k and ncol > 100 实际上foo很大,nrow> 20k,ncol> 100

Is it possible to do this without making extra copies of the data? 是否可以在不制作额外数据副本的情况下执行此操作?

My current approach is to apply over columns with paste to get a single value for each column and then do length(unique(.)) on the result... 我目前的方法是在带有paste列上apply以获得每列的单个值,然后对结果执行length(unique(.)) ...

Is there any magic with data.table::transpose() , data.table::uniqueN , and maybe some other friends? data.table::transpose()data.table::uniqueN ,还有其他一些朋友,有什么魔力吗?

Another possibility: 另一种可能性

length(unique(as.list(foo)))

Which gives the expected result: 这给出了预期的结果:

 > length(unique(as.list(foo))) [1] 3 

NOTE: the use of length(unique()) is necessary as uniqueN() will return an error. 注意:使用length(unique())是必要的,因为uniqueN()将返回错误。

Per the comment of @Ryan, you can also do: 根据@Ryan的评论,你也可以这样做:

length(unique.default(foo))

With regard to speed, both methods are comparable (when measured on a larger dataset of 5M rows): 关于速度,两种方法都是可比较的(当在更大的5M行数据集上测量时):

 > fooLarge <- foo[rep(1:nrow(foo),1e6)] > microbenchmark(length(unique.default(fooLarge)), length(unique(as.list(fooLarge)))) Unit: milliseconds expr min lq mean median uq max neval cld length(unique.default(fooLarge)) 94.0433 94.56920 95.24076 95.01492 95.67131 103.15433 100 a length(unique(as.list(fooLarge))) 94.0254 94.68187 95.17648 95.02672 95.49857 99.19411 100 a 

If you want to retain only the unique columns, you could use: 如果您只想保留唯一列,可以使用:

# option 1
cols <- !duplicated(as.list(foo))
foo[, ..cols]

# option 2 (doesn't retain the column names)
as.data.table(unique.default(foo))

which gives (output option 1 shown): 给出(显示输出选项1):

  xy z1 1: 0.06666667 0.4 0.10 2: 0.13333333 0.3 0.15 3: 0.20000000 0.2 0.20 4: 0.26666667 0.1 0.25 5: 0.33333333 0.0 0.30 

transpose and check for non-duplicates 转置并检查非重复

ncol( foo[ , which( !duplicated( t( foo ) ) ), with = FALSE ])

3

Another method which may be faster if you expect a large number of duplicates: 如果您期望大量重复项,可能会更快的另一种方法:

n_unique_cols <- function(foo) {
  K <- seq_along(foo)
  for (j in seq_along(foo)) {
    if (j %in% K) {
      foo_j <- .subset2(foo, j)
      for (k in K) {
        if (j < k) {
          foo_k <- .subset2(foo, k)
          if (foo_j[1] == foo_k[1] && identical(foo_j, foo_k)) {
            K <- K[K != k]
          }
          rm(foo_k)
        }
      }
    }
  }
  length(K)
}

Timings: 时序:

library(data.table)
create_foo <- function(row, col) {
  foo <- data.table(x = rnorm(row), 
                    y = seq_len(row) - 2L)

  set.seed(1)
  for (k in seq_len(col %/% 2L)) {
    foo[, (paste0('x', k)) := x + sample(-4:4, size = 1)]
    foo[, (paste0('y', k)) := y + sample(-2:2, size = 1)]
  }
  foo
}

library(bench)
res <- 
  press(rows = c(1e5, 1e6, 1e7), 
        cols = c(10, 50, 100), 
        {

          foorc <- create_foo(rows, cols)
          bench::mark(n_unique_cols(foorc), 
                      length(unique(as.list(foorc))))
        })
plot(res)

For this family of data, this function is twice as fast, but its memory consumption grows faster than unique(as.list(.)) . 对于这个数据系列,此函数的速度是其两倍,但其内存消耗增长速度快于unique(as.list(.))

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM