简体   繁体   中英

Rolling Subsetting of Data.table for rolling statistics

I have a strange problem that I know I can solve with apply or some other looping structure but I feel like there should be a really clever way to do this. I have a data.table example_dt that I extract 2 id columns from to form a id data.table called id_dt .

I then want to use these id's to index into example_dt to compute some statistics. The trick is that the first id, id1 , needs to match. The second id, id2 , just needs to be within a certain range. I rename the columns in the id_dt to avoid naming conflicts. I'm not completely sure what's going on with the scoping in the data.table

library(data.table)
example_dt <- data.table( id1 = c(rep('a', 7), rep('b', 7)), id2 = c(1:7, 1:7), x1 = c(rep(1:2,7)))
id_dt <- example_dt[,.(id1, id2)]
setnames(id_dt, names(id_dt), c('id1_idx','id2_idx') )
result_dt <- id_dt[,example_dt[id1 == id1_idx & id2 <= id2_idx & id2 >= id2_idx - 2, mean(x1)]]

What I'm getting is just a single value of 1.5

> result_dt
[1] 1.5

What I want is this:

id1 id2 x1  mean
a   1   1   1
a   2   2   1.5
a   3   1   1.333333333
a   4   2   1.666666667
a   5   1   1.333333333
a   6   2   1.666666667
a   7   1   1.333333333
b   1   2   2
b   2   1   1.5
b   3   2   1.666666667
b   4   1   1.333333333
b   5   2   1.666666667
b   6   1   1.333333333
b   7   2   1.666666667

Like I said, I know I can do it with apply or some other looping structure. I'm wanting to see if there is some clever data.table incantation I am not aware of.

Here's one way using rolling joins :

setkey(example_dt, id1, id2)
idx1 = example_dt[.(id1, id2-2), roll=-Inf, which=TRUE]
idx2 = example_dt[.(id1, id2), roll=Inf, which=TRUE]

mapply(function(x,y) mean(example_dt$x1[x:y]), idx1, idx2)
#  [1] 1.000000 1.500000 1.333333 1.666667 1.333333 1.666667 1.333333 2.000000 1.500000
# [10] 1.666667 1.333333 1.666667 1.333333 1.666667

It could also be done using foverlaps() , but it seems a bit of an overkill. I suggest you've a look at ?data.table at the roll argument and work the examples there if you don't manage to get a hold of this.. (until vignettes for joins are completed). For other vignettes, check the Getting started page. For vignettes planned, have a look at this post .

This has come up quite a few times, that it might be worth making between() function in data.table capable of performing this (efficiently). I think there's an FR somewhere on the project page.

As to why you get a single value, you are doing DT[rows, mean(col)] , which reads.. extract col for rows specified in rows , and compute its mean. And that should return a single value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM