I have a strange problem that I know I can solve with apply
or some other looping structure but I feel like there should be a really clever way to do this. I have a data.table example_dt
that I extract 2 id columns from to form a id data.table called id_dt
.
I then want to use these id's to index into example_dt
to compute some statistics. The trick is that the first id, id1
, needs to match. The second id, id2
, just needs to be within a certain range. I rename the columns in the id_dt
to avoid naming conflicts. I'm not completely sure what's going on with the scoping in the data.table
library(data.table)
example_dt <- data.table( id1 = c(rep('a', 7), rep('b', 7)), id2 = c(1:7, 1:7), x1 = c(rep(1:2,7)))
id_dt <- example_dt[,.(id1, id2)]
setnames(id_dt, names(id_dt), c('id1_idx','id2_idx') )
result_dt <- id_dt[,example_dt[id1 == id1_idx & id2 <= id2_idx & id2 >= id2_idx - 2, mean(x1)]]
What I'm getting is just a single value of 1.5
> result_dt
[1] 1.5
What I want is this:
id1 id2 x1 mean
a 1 1 1
a 2 2 1.5
a 3 1 1.333333333
a 4 2 1.666666667
a 5 1 1.333333333
a 6 2 1.666666667
a 7 1 1.333333333
b 1 2 2
b 2 1 1.5
b 3 2 1.666666667
b 4 1 1.333333333
b 5 2 1.666666667
b 6 1 1.333333333
b 7 2 1.666666667
Like I said, I know I can do it with apply
or some other looping structure. I'm wanting to see if there is some clever data.table
incantation I am not aware of.
Here's one way using rolling joins
:
setkey(example_dt, id1, id2)
idx1 = example_dt[.(id1, id2-2), roll=-Inf, which=TRUE]
idx2 = example_dt[.(id1, id2), roll=Inf, which=TRUE]
mapply(function(x,y) mean(example_dt$x1[x:y]), idx1, idx2)
# [1] 1.000000 1.500000 1.333333 1.666667 1.333333 1.666667 1.333333 2.000000 1.500000
# [10] 1.666667 1.333333 1.666667 1.333333 1.666667
It could also be done using foverlaps()
, but it seems a bit of an overkill. I suggest you've a look at ?data.table
at the roll
argument and work the examples there if you don't manage to get a hold of this.. (until vignettes for joins are completed). For other vignettes, check the Getting started page. For vignettes planned, have a look at this post .
This has come up quite a few times, that it might be worth making between()
function in data.table
capable of performing this (efficiently). I think there's an FR somewhere on the project page.
As to why you get a single value, you are doing DT[rows, mean(col)]
, which reads.. extract col
for rows specified in rows
, and compute its mean. And that should return a single value.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.