子列data.table由多列键的第三列

Question

假设我有一个带有3列密钥的data.table。 例如，假设我们有时间嵌套在学校里的学生中。

dt <- data.table(expand.grid(schools = 200:210, students = 1:100, time = 1:5),
                 key = c("schools", "students", "time"))

并说我想把我的数据子集只包含时间5.我知道我可以使用subset ：

time.5 <- subset(dt, wave == 5)

或者我可以进行矢量扫描：

time.5 <- dt[wave == 5]

但那些不是“data.table方式” - 我想利用二进制搜索的速度。 由于我的密钥中有3列，因此使用unique如下产生不正确的结果：

dt[.(unique(schools), unique(students), 5)]

有任何想法吗？

Answer 1

你可以试试

 setkey(dt, time)
 dt[J(5)]

 all( dt[J(5)][,time]==5)
 #[1] TRUE

基准

dt1 <- data.table(expand.grid(schools=200:450, students=1:600,time=1:50),
        key=c('schools', 'students', 'time'))
f1 <- function(){dt1[time==5]}

f2 <- function(){setkey(dt1, time)
               new.dt <- dt1[J(5)]
             setkeyv(new.dt, colnames(dt1)) 
             }

 f3 <- function() {setkey(dt1, time)
                   dt1[J(5)]}


microbenchmark(f1(), f2(), f3(), unit='relative', times=20L)
#Unit: relative
#expr      min       lq     mean   median       uq      max neval cld
#f1() 3.188559 3.240377 3.342936 3.218387 3.224352 5.319811    20   b
#f2() 1.050202 1.083136 1.081707 1.089292 1.087572 1.129741    20  a 
#f3() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    20  a

Answer 2

如果查询性能是主要因素，您还可以加速@ akrun的解决方案。

# install_github("jangorecki/dwtools")
# or just source: https://github.com/jangorecki/dwtools/blob/master/R/idxv.R
library(dwtools)
# instead of single key you can define multiple to be used automatically without the need to re-setkey
Idx = list(
  c('schools', 'students', 'time'),
  c('time')
)
IDX <- idxv(dt1, Idx)
f4 <- function(){
  dt1[CJI(IDX,TRUE,TRUE,5)]
}
microbenchmark(f4(), f1(), f2(), f3(), unit='relative', times=1L)
#Unit: relative
#expr       min        lq      mean    median        uq       max neval
#f4()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000     1
#f1()  6.431114  6.431114  6.431114  6.431114  6.431114  6.431114     1
#f2()  2.320577  2.320577  2.320577  2.320577  2.320577  2.320577     1
#f3() 23.706655 23.706655 23.706655 23.706655 23.706655 23.706655     1

如果我错了，请纠正我，但似乎f3()计算在微基准测试times > 1L时重复使用它的键。

请注意，多个索引（ Idx ）需要大量内存。

子列data.table由多列键的第三列

问题描述

2 个解决方案

解决方案1
2 已采纳 2014-12-08 04:14:41

基准

解决方案2
0 2015-01-03 21:05:13

子列data.table由多列键的第三列

问题描述

2 个解决方案

解决方案1 2 已采纳 2014-12-08 04:14:41

基准

解决方案2 0 2015-01-03 21:05:13

解决方案1
2 已采纳 2014-12-08 04:14:41

解决方案2
0 2015-01-03 21:05:13