[英]Select data.table values based on a vector of column indexes
How to select values from data.table
based on a vector of column indexes.如何基于列索引向量从data.table
中获取 select 值。
I have an integer vector of the same length as the number of rows in a data.table:我有一个 integer 向量,其长度与 data.table 中的行数相同:
set.seed(100)
col.indexes <- sample(c(1:4), 150, replace = TRUE)
How to create a vector of values based on it?如何基于它创建一个值向量? eg this without for
loop:例如,没有for
循环:
iris <- setDT(iris)
res <- c()
for(i in 1:150) {
res[i] <- iris[i, .SD, .SDcols = col.indexes[i]]
}
res <- unlist(res)
this is loosely based on this question: How to subset the next column in R这大致基于这个问题: How to subset the next column in R
We can do a group by sequence of rows and extract the values我们可以按行序列进行分组并提取值
res <- iris[, col1 := col.indexes][, .SD[[col1[1]]], 1:nrow(iris)]$V1
Or in base R
, it can be done in a vectorized way或者在base R
中,可以通过矢量化方式完成
iris <- setDF(iris)
iris[1:4][cbind(seq_len(nrow(iris)), col.indexes)]
Here's a complicated answer using melt
and a join.这是使用melt
和连接的复杂答案。 Using a data.frame
is better for this:使用data.frame
更好:
library(data.table)
dt <- as.data.table(iris)
dt[, ID := .I]
dt[, Species := NULL]
melt(dt, id.vars = 'ID'
)[, variable := as.integer(variable)
][data.frame(col.indexes, ID = seq_len(150))
, on = .(ID, variable = col.indexes)
, value
]
Here's @akrun's base method doing awesome:这是@akrun 的基本方法做得很棒:
# A tibble: 7 x 13
expression min median `itr/sec` mem_alloc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
1 akrun_dt 4.49ms 4.78ms 190. 107.84KB
2 akrun_base 122us 127.4us 7575. 8.44KB
3 cole_melt 3.99ms 4.24ms 233. 271.41KB
4 Pavo_diag 3.32ms 3.45ms 283. 449.44KB
5 OP_loop 83.08ms 84.03ms 11.9 4.86MB
6 OP_loop_dt_mod 1.32ms 1.36ms 712. 13.76KB
7 OP_loop_mat_mod 373.9us 389.2us 2472. 17.17KB
I also did 1E5 rows per @Bulat's comment.我还根据@Bulat 的评论做了 1E5 行。 I got an error with @PavoDive's method so I excluded it. @PavoDive 的方法出现错误,因此我将其排除在外。
# A tibble: 7 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl>
1 akrun_dt 2.19s 2.19s 0.456 6.58MB 1.37 1 3
2 akrun_base 2.73ms 2.88ms 58.5 8.79MB 7.63 46 6
3 cole_melt 30.53ms 33.63ms 29.6 17.79MB 1.97 15 1
4 OP_loop 1.07m 1.07m 0.0156 3.16GB 0.810 1 52
5 OP_loop_df_mod 991.45ms 991.45ms 1.01 3.3MB 2.02 1 2
6 OP_loop_dt_mod 1.07s 1.07s 0.930 3.3MB 1.86 1 2
7 OP_loop_mat_mod 218.95ms 235.98ms 4.29 4.58MB 1.43 3 1
Then I upped it to 1E7 rows:然后我把它提高到 1E7 行:
# A tibble: 2 x 13
expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
<bch:expr> <bch> <bch:> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
1 akrun_base 2.21s 2.21s 0.452 877.4MB 0.904 1 2 2.21s
2 cole_melt 4.88s 4.88s 0.205 1.71GB 0.410 1 2 4.88s
Complete code for benchmarks:基准测试的完整代码:
library(data.table)
set.seed(100)
ind <- 1E5
col.indexes <- sample(c(1:4), ind, replace = TRUE)
dt1 <- as.data.table(iris[sample(nrow(iris), ind, replace = T), ])
bench::mark(
akrun_dt = {
dt <- copy(dt1)
dt[, col1 := col.indexes][, .SD[[col1[1]]], 1:nrow(dt)]$V1
}
,
akrun_base = {
DF <- copy(dt1)
setDF(DF)
DF[1:4][cbind(seq_len(nrow(DF)), col.indexes)]
}
,
cole_melt = {
dt <- copy(dt1)
dt[, ID := .I]
dt[, Species := NULL]
melt(dt, id.vars = 'ID'
)[, variable := as.integer(variable)
][data.frame(col.indexes, ID = seq_len(ind))
, on = .(ID, variable = col.indexes)
, value
]
}
# ,Pavo_diag = {
# diag(as.matrix(dt1[, .SD, .SDcols = col.indexes]))
# }
,
OP_loop = {
res <- c()
for(i in seq_len(ind)) {
res[i] <- dt1[i, .SD, .SDcols = col.indexes[i]]
}
unlist(res)
}
,
OP_loop_df_mod = {
sapply(seq_len(ind), function(i) DF[[col.indexes[i]]][i])
}
,
OP_loop_dt_mod = {
sapply(seq_len(ind), function(i) dt1[[col.indexes[i]]][i])
}
,
OP_loop_mat_mod = {
mat <- as.matrix(DF[1:4])
colnames(mat) <- NULL
unlist(lapply(seq_len(ind), function(i) mat[i, col.indexes[i]]), use.names = F)
}
)
I see another option:我看到另一个选择:
res2 <- diag(as.matrix(iris[, .SD, .SDcols = col.indexes]))
all.equal(res2, res)
[1] TRUE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.