[英]How to find the last or next entry using R package data.table and rolling joins
Lets say I have a data table like this.
customer_id time_stamp value
1: 1 223 4
2: 1 252 1
3: 1 456 3
4: 2 455 5
5: 2 632 2
So that customer_id and time_stamp together form a unique key. 这样customer_id和time_stamp一起形成一个唯一的密钥。 I want to add some new columns indicating the previous and last values of "value". 我想添加一些新列,指示“value”的上一个和最后一个值。 That is, I want output like this. 也就是说,我想要这样的输出。
customer_id time_stamp value value_PREV value_NEXT
1: 1 223 4 NA 1
2: 1 252 1 4 3
3: 1 456 3 1 NA
4: 2 455 5 NA 2
5: 2 632 2 5 NA
I want this to be fast and work with sparse, irregular times. 我希望这很快,并且可以处理稀疏,不规则的时间。 I thought that the data.table rolling join would do it for me. 我认为data.table滚动连接会为我做。 However the rolling join appears to find the last time OR same time. 但是,滚动连接似乎找到最后一次或同一时间。 So if you do a rolling join on two copies of the same table (after adding _PREV to the column names of the copy), this doesn't quite work. 因此,如果您对同一个表的两个副本进行滚动连接(在将_PREV添加到副本的列名之后),则这不起作用。 You can fudge it by adding a tiny number to the time variable of the copy but this is kinda awkward. 您可以通过在副本的时间变量中添加一个小数字来捏造它,但这有点尴尬。
Is there a way to do this simply with rollin join or some other data.table method? 有没有办法简单地使用rollin join或其他一些data.table方法? I've found an efficient way but it still requires about 40 lines of R code. 我找到了一种有效的方法,但它仍然需要大约40行R代码。 It seems that this could be a one-liner if rolling join could be told to look for the last time NOT including the same time. 如果滚动连接可以被告知寻找最后一次不包括相同的时间,这似乎是一个单行。 Or maybe there is some other neat trick. 或许还有其他一些巧妙的伎俩。
Here is the example data. 这是示例数据。
data=data.table(customer_id=c(1,2,1,1,2),time_stamp=c(252,632,456,223,455),value=c(1,2,3,4,5))
data_sorted=data[order(customer_id,time_stamp)]
This is the code I wrote. 这是我写的代码。 Note that the lines putting NA into the ones where customer_id differ throws a warning and probably needs changing. 请注意,将NA放入customer_id不同的行会引发警告,可能需要更改。 I have them commented out below. 我让他们在下面评论。 Anyone have any suggestions for replacing those two lines? 有没有人建议更换这两条线?
add_prev_next_cbind<-function(data,ident="customer_id",timecol="time_stamp",prev_tag="PREV",
next_tag="NEXT",sep="_"){
o=order(data[[ident]],data[[timecol]])
uo=order(o)
data=data[o,]
Nrow=nrow(data)
Ncol=ncol(data)
#shift it, put any junk in the first row
data_prev=data[c(1,1:(Nrow-1)),]
#shift it, put any junk in the last row
data_next=data[c(2:(Nrow),Nrow),]
#flag the rows where the identity changes, these get NA
prev_diff=data[[ident]] != data_prev[[ident]]
prev_diff[1]=T
next_diff=data[[ident]] != data_next[[ident]]
next_diff[Nrow]=T
#change names
names=names(data)
names_prev=paste(names,prev_tag,sep=sep)
names_next=paste(names,next_tag,sep=sep)
setnames(data_prev,names,names_prev)
setnames(data_next,names,names_next)
#put NA in rows where prev and next are from a different ident
#replace the next two lines with something else
#data_prev[prev_diff,]<-NA
#data_next[next_diff,]<-NA
data_all=cbind(data,data_prev,data_next)
data_all=data_all[uo,]
return(data_all)
}
- New function
shift()
implements fastlead/lag
of vector , list , data.frames or data.tables . 新函数shift()
实现了vector , list , data.frames或data.tables的快速lead/lag
。 It takes atype
argument which can be either "lag" (default) or "lead" and always returns a list, which makes it very convenient to use it along with:=
orset()
. 它需要一个type
参数,它可以是“滞后” (默认)或“引导”,并且总是返回一个列表,这使得它与:=
内容一起使用非常方便:=
或set()
。 For example:DT[, (cols) := shift(.SD, 1L), by=id]
. 例如:DT[, (cols) := shift(.SD, 1L), by=id]
。 Please have a look at?shift
for more info. 请查看?shift
获取更多信息。
Now we can therefore do: 现在我们可以这样做:
dt[, c("value_PREV", "value_NEXT") := c(shift(value, 1L, type="lag"),
shift(value, 1L, type="lead")), by=customer_id]
You don't need a roll join here at all. 你根本不需要滚动连接。 you can do this with head
and tail
. 你可以用head
和tail
做到这一点。 Assuming your data.table
is DT: 假设您的data.table
是DT:
setkey(DT, "customer_id")
DT[, list(time_stamp = time_stamp,
prev.val = c(NA, head(value, -1)),
next.val = c(tail(value, -1), NA)),
by=customer_id]
# customer_id time_stamp prev.val next.val
# 1: 1 223 NA 1
# 2: 1 252 4 3
# 3: 1 456 1 NA
# 4: 2 455 NA 2
# 5: 2 632 5 NA
Edit: Even better: 编辑:更好:
DT[, `:=`(prev.val = c(NA, head(value, -1)),
next.val = c(tail(value, -1), NA)),
by=customer_id]
Yes if I don't want roll
to equimatch then I also take a little bit off if it's type double, or work with integer and add or subtract 1L. 是的,如果我不想roll
到equimatch那么我也会稍微关闭它,如果它是double类型,或者使用整数并加1或减1L。
DT = data.table( customer_id=c(1,2,1,1,2),
time_stamp=as.integer(c(252,632,456,223,455)),
value=c(1,2,3,4,5))
setkey(DT, customer_id, time_stamp)
DT[ DT[,list(customer_id,time_stamp+1L,value)], value_PREV:=i.value, roll=-Inf]
DT[ DT[,list(customer_id,time_stamp-1L,value)], value_NEXT:=i.value, roll=+Inf]
DT
customer_id time_stamp value value_PREV value_NEXT
1: 1 223 4 NA 1
2: 1 252 1 4 3
3: 1 456 3 1 NA
4: 2 455 5 NA 2
5: 2 632 2 5 NA
To have to take a column subset of DT
again in i
like that is a bit awkward, I agree. 不得不采取的列子集DT
再次i
想这是一个有点尴尬,我同意。
Have now filed FR#2628 to add a new argument rollequal=TRUE|FALSE
. 现已提交FR#2628添加新参数rollequal=TRUE|FALSE
。 Then it would be : 然后它会是:
setkey(DT, customer_id, time_stamp)
DT[ DT, value_PREV:=i.value, roll=-Inf, rollequal=FALSE]
DT[ DT, value_NEXT:=i.value, roll=+Inf, rollequal=FALSE]
That would be faster too by avoiding the copy of the i
columns and not needing to allocate for time_stamp-1L
and time_stamp+1L
. 通过避免i
列的副本而不需要为time_stamp-1L
和time_stamp+1L
分配,这也会更快。
But in this case, it's a self join from DT
to DT
and DT
's key is unique, so as Arun says, a roll
join isn't needed. 但在这种情况下,它从一个自联接DT
到DT
和DT
的关键是独一无二的,所以作为阿伦说,一个roll
联接是没有必要的。 Maybe a fast shift or lag function is needed to avoid the overhead of c()
and head()
or tail()
, for speed. 可能需要快速移位或滞后函数来避免c()
和head()
或tail()
的开销,以提高速度。
Thanks for highlighting! 谢谢你的突出!
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.