简体   繁体   English

基于列索引向量的 Select data.table 值

[英]Select data.table values based on a vector of column indexes

How to select values from data.table based on a vector of column indexes.如何基于列索引向量从data.table中获取 select 值。

I have an integer vector of the same length as the number of rows in a data.table:我有一个 integer 向量,其长度与 data.table 中的行数相同:

set.seed(100)
col.indexes <- sample(c(1:4), 150, replace = TRUE)

How to create a vector of values based on it?如何基于它创建一个值向量? eg this without for loop:例如,没有for循环:

iris <- setDT(iris)
res <- c()
for(i in 1:150) {
  res[i] <- iris[i, .SD, .SDcols = col.indexes[i]]
}
res <- unlist(res)

this is loosely based on this question: How to subset the next column in R这大致基于这个问题: How to subset the next column in R

We can do a group by sequence of rows and extract the values我们可以按行序列进行分组并提取值

res <- iris[, col1 := col.indexes][, .SD[[col1[1]]], 1:nrow(iris)]$V1

Or in base R , it can be done in a vectorized way或者在base R中,可以通过矢量化方式完成

iris <- setDF(iris)
iris[1:4][cbind(seq_len(nrow(iris)), col.indexes)]

Here's a complicated answer using melt and a join.这是使用melt和连接的复杂答案。 Using a data.frame is better for this:使用data.frame更好:

library(data.table)
dt <- as.data.table(iris)

dt[, ID := .I]
dt[, Species := NULL]

melt(dt, id.vars = 'ID'
     )[, variable := as.integer(variable)
       ][data.frame(col.indexes, ID = seq_len(150))
         , on = .(ID, variable = col.indexes)
         , value
         ]

Here's @akrun's base method doing awesome:这是@akrun 的基本方法做得很棒:

# A tibble: 7 x 13
  expression           min   median `itr/sec` mem_alloc
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 akrun_dt          4.49ms   4.78ms     190.   107.84KB
2 akrun_base         122us  127.4us    7575.     8.44KB
3 cole_melt         3.99ms   4.24ms     233.   271.41KB
4 Pavo_diag         3.32ms   3.45ms     283.   449.44KB
5 OP_loop          83.08ms  84.03ms      11.9    4.86MB
6 OP_loop_dt_mod    1.32ms   1.36ms     712.    13.76KB
7 OP_loop_mat_mod  373.9us  389.2us    2472.    17.17KB

I also did 1E5 rows per @Bulat's comment.我还根据@Bulat 的评论做了 1E5 行。 I got an error with @PavoDive's method so I excluded it. @PavoDive 的方法出现错误,因此我将其排除在外。

# A tibble: 7 x 13
  expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 akrun_dt           2.19s    2.19s    0.456     6.58MB    1.37      1     3
2 akrun_base        2.73ms   2.88ms   58.5       8.79MB    7.63     46     6
3 cole_melt        30.53ms  33.63ms   29.6      17.79MB    1.97     15     1
4 OP_loop            1.07m    1.07m    0.0156    3.16GB    0.810     1    52
5 OP_loop_df_mod  991.45ms 991.45ms    1.01       3.3MB    2.02      1     2
6 OP_loop_dt_mod     1.07s    1.07s    0.930      3.3MB    1.86      1     2
7 OP_loop_mat_mod 218.95ms 235.98ms    4.29      4.58MB    1.43      3     1

Then I upped it to 1E7 rows:然后我把它提高到 1E7 行:

# A tibble: 2 x 13
  expression   min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 akrun_base 2.21s  2.21s     0.452   877.4MB    0.904     1     2      2.21s
2 cole_melt  4.88s  4.88s     0.205    1.71GB    0.410     1     2      4.88s

Complete code for benchmarks:基准测试的完整代码:

library(data.table)

set.seed(100)
ind <- 1E5

col.indexes <- sample(c(1:4), ind, replace = TRUE)
dt1 <- as.data.table(iris[sample(nrow(iris), ind, replace = T), ])

bench::mark(
  akrun_dt = {
    dt <- copy(dt1)
    dt[, col1 := col.indexes][, .SD[[col1[1]]], 1:nrow(dt)]$V1
  }
  ,
  akrun_base = {
    DF <- copy(dt1)
    setDF(DF)
    DF[1:4][cbind(seq_len(nrow(DF)), col.indexes)]
  }
  ,
  cole_melt = {
    dt <- copy(dt1)
    dt[, ID := .I]
    dt[, Species := NULL]

    melt(dt, id.vars = 'ID'
    )[, variable := as.integer(variable)
      ][data.frame(col.indexes, ID = seq_len(ind))
        , on = .(ID, variable = col.indexes)
        , value
        ]
  }
  # ,Pavo_diag = {
  #   diag(as.matrix(dt1[, .SD, .SDcols = col.indexes]))
  # }
  ,
  OP_loop = {
    res <- c()

    for(i in seq_len(ind)) {
      res[i] <- dt1[i, .SD, .SDcols = col.indexes[i]]
    }
    unlist(res)
  }
  ,
  OP_loop_df_mod = {
    sapply(seq_len(ind), function(i) DF[[col.indexes[i]]][i])
  }
  ,
  OP_loop_dt_mod = {
    sapply(seq_len(ind), function(i) dt1[[col.indexes[i]]][i])
  }
  ,
  OP_loop_mat_mod = {
    mat <- as.matrix(DF[1:4])
    colnames(mat) <- NULL
    unlist(lapply(seq_len(ind), function(i) mat[i, col.indexes[i]]), use.names = F)
  }
)

I see another option:我看到另一个选择:

res2 <- diag(as.matrix(iris[, .SD, .SDcols = col.indexes]))
all.equal(res2, res)

[1] TRUE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM