R中data.table中的快速子集化

Question

Given a data.table , I would like to subset the items in there quickly . 给定data.table ，我想快速将项目子集在那里。 For example: 例如：

dt = data.table(a=1:10, key="a")
dt[a > 3 & a <= 7]

This is pretty slow still. 这仍然很慢。 I know I can do joins to get individual rows but is there a way to fact that the data.table is sorted to get quick subsets of this kind? 我知道我可以做连接来获取单独的行，但有没有办法确定data.table被排序以获得这种快速子集？

This is what I'm doing: 这就是我正在做的事情：

dt1 = data.table(id = 1, ym = c(199001, 199006, 199009, 199012), last_ym = c(NA, 199001, 199006, 199009), v = 1:4, key=c("id", "ym"))
dt2 = data.table(id = 1, ym = c(199001, 199002, 199003, 199004, 199005, 199006, 199007, 199008, 199009, 199010, 199011, 199012), v2 = 1:12, key=c("id","ym"))

For each id , here there is only 1, and ym in dt1 , I would like to sum the values of v2 between current ym in dt1 and the last ym in dt1 . 对于每个id ，这里只有1，和ym在dt1 ，我想总结的值v2电流之间ym在dt1和最后ym的dt1 。 That is, for ym == 199006 in dt1 I would like to return list(v2 = 2 + 3 + 4 + 5 + 6) . 也就是说，对于dt1 ym == 199006 ，我想返回list(v2 = 2 + 3 + 4 + 5 + 6) 。 These are the values of v2 in dt2 that are equal to or less than the current ym (excluding the previous ym). 这些是dt2中v2的值等于或小于当前ym （不包括前一个ym）。 In code: 在代码中：

expr = expression({ #browser();
 cur_id = id; 
 cur_ym = ym; 
 cur_dtb = dt2[J(cur_id)][ym <= cur_ym & ym > last_ym]; 
 setkey(cur_dtb , ym);
 list(r = sum(cur_dtb$v2))
})

dt1[,eval(expr ),by=list(id, ym)]

Answer 1

To avoid the logical condition, perform a rolling join of dt1 and dt2 . 要避免逻辑条件，请执行dt1和dt2的滚动连接。 Then shift ym forward by one position within id . 然后将ym向前移动到id内的一个位置。 Finally, sum over v2 by id and ym : 最后，通过id和ym对v2求和：

setkey(dt1, id, last_ym)
setkey(dt2, id, ym)
dt1[dt2,, roll = TRUE][
       , list(v2 = v2, ym = c(last_ym[1], head(ym, -1))), by = id][
       , list(v2 = sum(v2)), by = list(id, ym)]

Note that we want to sum everything since the last_ym so the key on dt1 must be last_ym rather than ym . 请注意，我们要对last_ym所有内容last_ym因此dt1上的键必须是last_ym而不是ym 。

The result is: 结果是：

   id     ym v2
1:  1 199001  1
2:  1 199006 20
3:  1 199009 24
4:  1 199012 33

UPDATE: correction 更新：更正

Answer 2

Regardless of the fact that data.table is sorted, you will be limited to the amount of time it takes to evaluate a > 3 & a <= 7 in the first place: 无论data.table是如何排序的，您都将被限制为首先评估a > 3 & a <= 7所需的时间：

> dt = data.table(a=1:10000000, key="a")
> system.time(dt$a > 3 & dt$a <= 7)
   user  system elapsed 
   0.18    0.01    0.20 
> system.time(dt[,a > 3 & a <= 7])
   user  system elapsed 
   0.18    0.05    0.24 
> system.time(dt[a > 3 & a <= 7])
   user  system elapsed 
   0.25    0.07    0.31

Alternative approach: 替代方法：

> system.time({Indices = dt$a > 3 & dt$a <= 7 ; dt[Indices]})
user  system elapsed 
0.28    0.03    0.31

Multiple Subsets 多个子集

There can be a speed issue here if you break up factors on an ad hoc basis rather than doing it all at once first: 如果您在临时基础上分解因素而不是首先完成所有操作，则可能存在速度问题：

> dt <- data.table(A=sample(letters, 10000, replace=T))
> system.time(for(i in unique(dt$A)) dt[A==i])
   user  system elapsed 
   5.16    0.42    5.59 
> system.time(dt[,.SD,by=A])
   user  system elapsed 
   0.32    0.03    0.36

R中data.table中的快速子集化

问题描述

2 个解决方案

解决方案1
4 已采纳 2013-07-05 20:02:01

解决方案2
1 2013-07-05 18:46:49

R中data.table中的快速子集化

问题描述

2 个解决方案

解决方案1 4 已采纳 2013-07-05 20:02:01

解决方案2 1 2013-07-05 18:46:49

解决方案1
4 已采纳 2013-07-05 20:02:01

解决方案2
1 2013-07-05 18:46:49