简体   繁体   English

如何使用R包data.table和滚动连接查找最后一个或下一个条目

[英]How to find the last or next entry using R package data.table and rolling joins

Lets say I have a data table like this.

   customer_id time_stamp value
1:           1        223     4
2:           1        252     1
3:           1        456     3
4:           2        455     5
5:           2        632     2

So that customer_id and time_stamp together form a unique key. 这样customer_id和time_stamp一起形成一个唯一的密钥。 I want to add some new columns indicating the previous and last values of "value". 我想添加一些新列,指示“value”的上一个和最后一个值。 That is, I want output like this. 也就是说,我想要这样的输出。

  customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

I want this to be fast and work with sparse, irregular times. 我希望这很快,并且可以处理稀疏,不规则的时间。 I thought that the data.table rolling join would do it for me. 我认为data.table滚动连接会为我做。 However the rolling join appears to find the last time OR same time. 但是,滚动连接似乎找到最后一次或同一时间。 So if you do a rolling join on two copies of the same table (after adding _PREV to the column names of the copy), this doesn't quite work. 因此,如果您对同一个表的两个副本进行滚动连接(在将_PREV添加到副本的列名之后),则这不起作用。 You can fudge it by adding a tiny number to the time variable of the copy but this is kinda awkward. 您可以通过在副本的时间变量中添加一个小数字来捏造它,但这有点尴尬。

Is there a way to do this simply with rollin join or some other data.table method? 有没有办法简单地使用rollin join或其他一些data.table方法? I've found an efficient way but it still requires about 40 lines of R code. 我找到了一种有效的方法,但它仍然需要大约40行R代码。 It seems that this could be a one-liner if rolling join could be told to look for the last time NOT including the same time. 如果滚动连接可以被告知寻找最后一次不包括相同的时间,这似乎是一个单行。 Or maybe there is some other neat trick. 或许还有其他一些巧妙的伎俩。

Here is the example data. 这是示例数据。

data=data.table(customer_id=c(1,2,1,1,2),time_stamp=c(252,632,456,223,455),value=c(1,2,3,4,5))
data_sorted=data[order(customer_id,time_stamp)]

This is the code I wrote. 这是我写的代码。 Note that the lines putting NA into the ones where customer_id differ throws a warning and probably needs changing. 请注意,将NA放入customer_id不同的行会引发警告,可能需要更改。 I have them commented out below. 我让他们在下面评论。 Anyone have any suggestions for replacing those two lines? 有没有人建议更换这两条线?

add_prev_next_cbind<-function(data,ident="customer_id",timecol="time_stamp",prev_tag="PREV",
                   next_tag="NEXT",sep="_"){
  o=order(data[[ident]],data[[timecol]])
  uo=order(o)
  data=data[o,]
  Nrow=nrow(data)
  Ncol=ncol(data)
  #shift it, put any junk in the first row
  data_prev=data[c(1,1:(Nrow-1)),]
  #shift it, put any junk in the last row
  data_next=data[c(2:(Nrow),Nrow),]
  #flag the rows where the identity changes, these get NA
  prev_diff=data[[ident]] != data_prev[[ident]]
  prev_diff[1]=T
  next_diff=data[[ident]] != data_next[[ident]]  
  next_diff[Nrow]=T
  #change names
  names=names(data)
  names_prev=paste(names,prev_tag,sep=sep)
  names_next=paste(names,next_tag,sep=sep)
  setnames(data_prev,names,names_prev)
  setnames(data_next,names,names_next)
  #put NA in rows where prev and next are from a different ident
  #replace the next two lines with something else
  #data_prev[prev_diff,]<-NA
  #data_next[next_diff,]<-NA
  data_all=cbind(data,data_prev,data_next)
  data_all=data_all[uo,]
  return(data_all)
}

Update: #965 is now implemented in 1.9.5 . 更新: #965现在在1.9.5中实现。 From NEWS : 来自新闻

  1. New function shift() implements fast lead/lag of vector , list , data.frames or data.tables . 新函数shift()实现了vectorlistdata.framesdata.tables的快速lead/lag It takes a type argument which can be either "lag" (default) or "lead" and always returns a list, which makes it very convenient to use it along with := or set() . 它需要一个type参数,它可以是“滞后” (默认)或“引导”,并且总是返回一个列表,这使得它与:=内容一起使用非常方便:=set() For example: DT[, (cols) := shift(.SD, 1L), by=id] . 例如: DT[, (cols) := shift(.SD, 1L), by=id] Please have a look at ?shift for more info. 请查看?shift获取更多信息。

Now we can therefore do: 现在我们可以这样做:

dt[, c("value_PREV", "value_NEXT") := c(shift(value, 1L, type="lag"), 
                     shift(value, 1L, type="lead")), by=customer_id]

You don't need a roll join here at all. 你根本不需要滚动连接。 you can do this with head and tail . 你可以用headtail做到这一点。 Assuming your data.table is DT: 假设您的data.table是DT:

setkey(DT, "customer_id")
DT[, list(time_stamp = time_stamp, 
          prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
by=customer_id]
#   customer_id time_stamp prev.val next.val
# 1:           1        223       NA        1
# 2:           1        252        4        3
# 3:           1        456        1       NA
# 4:           2        455       NA        2
# 5:           2        632        5       NA

Edit: Even better: 编辑:更好:

DT[, `:=`(prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
          by=customer_id]

Yes if I don't want roll to equimatch then I also take a little bit off if it's type double, or work with integer and add or subtract 1L. 是的,如果我不想roll到equimatch那么我也会稍微关闭它,如果它是double类型,或者使用整数并加1或减1L。

DT = data.table( customer_id=c(1,2,1,1,2), 
                 time_stamp=as.integer(c(252,632,456,223,455)),
                 value=c(1,2,3,4,5))
setkey(DT, customer_id, time_stamp)
DT[ DT[,list(customer_id,time_stamp+1L,value)], value_PREV:=i.value, roll=-Inf]
DT[ DT[,list(customer_id,time_stamp-1L,value)], value_NEXT:=i.value, roll=+Inf]
DT
   customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

To have to take a column subset of DT again in i like that is a bit awkward, I agree. 不得不采取的列子集DT再次i想这是一个有点尴尬,我同意。

Have now filed FR#2628 to add a new argument rollequal=TRUE|FALSE . 现已提交FR#2628添加新参数rollequal=TRUE|FALSE Then it would be : 然后它会是:

setkey(DT, customer_id, time_stamp)
DT[ DT, value_PREV:=i.value, roll=-Inf, rollequal=FALSE]
DT[ DT, value_NEXT:=i.value, roll=+Inf, rollequal=FALSE]

That would be faster too by avoiding the copy of the i columns and not needing to allocate for time_stamp-1L and time_stamp+1L . 通过避免i列的副本而不需要为time_stamp-1Ltime_stamp+1L分配,这也会更快。

But in this case, it's a self join from DT to DT and DT 's key is unique, so as Arun says, a roll join isn't needed. 但在这种情况下,它从一个自联接DTDTDT的关键是独一无二的,所以作为阿伦说,一个roll联接是没有必要的。 Maybe a fast shift or lag function is needed to avoid the overhead of c() and head() or tail() , for speed. 可能需要快速移位或滞后函数来避免c()head()tail()的开销,以提高速度。

Thanks for highlighting! 谢谢你的突出!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM