[英]fast subsetting in data.table in R
Given a data.table
, I would like to subset the items in there quickly . 给定data.table
,我想快速将项目子集在那里。 For example: 例如:
dt = data.table(a=1:10, key="a")
dt[a > 3 & a <= 7]
This is pretty slow still. 这仍然很慢。 I know I can do joins to get individual rows but is there a way to fact that the data.table
is sorted to get quick subsets of this kind? 我知道我可以做连接来获取单独的行,但有没有办法确定data.table
被排序以获得这种快速子集?
This is what I'm doing: 这就是我正在做的事情:
dt1 = data.table(id = 1, ym = c(199001, 199006, 199009, 199012), last_ym = c(NA, 199001, 199006, 199009), v = 1:4, key=c("id", "ym"))
dt2 = data.table(id = 1, ym = c(199001, 199002, 199003, 199004, 199005, 199006, 199007, 199008, 199009, 199010, 199011, 199012), v2 = 1:12, key=c("id","ym"))
For each id
, here there is only 1, and ym
in dt1
, I would like to sum the values of v2
between current ym
in dt1
and the last ym
in dt1
. 对于每个id
,这里只有1,和ym
在dt1
,我想总结的值v2
电流之间ym
在dt1
和最后ym
的dt1
。 That is, for ym == 199006
in dt1
I would like to return list(v2 = 2 + 3 + 4 + 5 + 6)
. 也就是说,对于dt1
ym == 199006
,我想返回list(v2 = 2 + 3 + 4 + 5 + 6)
。 These are the values of v2
in dt2
that are equal to or less than the current ym
(excluding the previous ym). 这些是dt2
中v2
的值等于或小于当前ym
(不包括前一个ym)。 In code: 在代码中:
expr = expression({ #browser();
cur_id = id;
cur_ym = ym;
cur_dtb = dt2[J(cur_id)][ym <= cur_ym & ym > last_ym];
setkey(cur_dtb , ym);
list(r = sum(cur_dtb$v2))
})
dt1[,eval(expr ),by=list(id, ym)]
To avoid the logical condition, perform a rolling join of dt1
and dt2
. 要避免逻辑条件,请执行dt1
和dt2
的滚动连接。 Then shift ym
forward by one position within id
. 然后将ym
向前移动到id
内的一个位置。 Finally, sum over v2
by id
and ym
: 最后,通过id
和ym
对v2
求和:
setkey(dt1, id, last_ym)
setkey(dt2, id, ym)
dt1[dt2,, roll = TRUE][
, list(v2 = v2, ym = c(last_ym[1], head(ym, -1))), by = id][
, list(v2 = sum(v2)), by = list(id, ym)]
Note that we want to sum everything since the last_ym
so the key on dt1
must be last_ym
rather than ym
. 请注意,我们要对last_ym
所有内容last_ym
因此dt1
上的键必须是last_ym
而不是ym
。
The result is: 结果是:
id ym v2
1: 1 199001 1
2: 1 199006 20
3: 1 199009 24
4: 1 199012 33
UPDATE: correction 更新:更正
Regardless of the fact that data.table
is sorted, you will be limited to the amount of time it takes to evaluate a > 3 & a <= 7
in the first place: 无论data.table
是如何排序的,您都将被限制为首先评估a > 3 & a <= 7
所需的时间:
> dt = data.table(a=1:10000000, key="a")
> system.time(dt$a > 3 & dt$a <= 7)
user system elapsed
0.18 0.01 0.20
> system.time(dt[,a > 3 & a <= 7])
user system elapsed
0.18 0.05 0.24
> system.time(dt[a > 3 & a <= 7])
user system elapsed
0.25 0.07 0.31
Alternative approach: 替代方法:
> system.time({Indices = dt$a > 3 & dt$a <= 7 ; dt[Indices]})
user system elapsed
0.28 0.03 0.31
Multiple Subsets 多个子集
There can be a speed issue here if you break up factors on an ad hoc basis rather than doing it all at once first: 如果您在临时基础上分解因素而不是首先完成所有操作,则可能存在速度问题:
> dt <- data.table(A=sample(letters, 10000, replace=T))
> system.time(for(i in unique(dt$A)) dt[A==i])
user system elapsed
5.16 0.42 5.59
> system.time(dt[,.SD,by=A])
user system elapsed
0.32 0.03 0.36
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.