簡體   English   中英

R:使用其他列的滯后值和data.table的許多其他條件創建新列

[英]R: New column creation using lag values from other columns & many other conditions with data.table

以下是最后一列為所需列的示例數據

data<-structure(list(engagement_date = structure(c(16939, 16939, 16939, 
                                            16939, 16939, 16939, 16939, 16939), class = "Date"), driver_id = structure(c(1L, 
                                                                                                                         1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "69", class = "factor"), 
              session_id = structure(1:8, .Label = c("16525506", "16526272", 
                                                     "16527063", "16531156", "16532064", "16533490", "16541432", 
                                                     "16547653", "16548040", "16553477", "16558000"), class = "factor"), 
              status = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 1L, 2L), .Label = c("3", 
                                                                               "4", "6", "7"), class = "factor"), req_made_time = structure(c(1463556140, 
                                                                                                                                              1463556681, 1463557268, 1463560083, 1463560796, 1463562026, 
                                                                                                                                              1463568316, 1463572256), class = c("POSIXct", "POSIXt"), tzone = ""), 
              ride_drop_time = structure(c(NA, NA, NA, NA, 1463561749, 
                                           NA, 1463569532, NA), class = c("POSIXct", "POSIXt"), tzone = ""), 
              cmplt_flag = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("0", 
                                                                                   "1"), class = "factor"), req_no = 1:8, last_req_diff = c(0, 
                                                                                                                                            9, 9.8, 46.9, 11.9, 20.5, 104.8, 65.7), last_ride_diff = c(720, 
                                                                                                                                                                                                       729, 738.8, 785.7, 797.6, 4.6, 109.4, 45.4)), .Names = c("engagement_date", 
                                                                                                                                                                                                                                                                "driver_id", "session_id", "status", "req_made_time", "ride_drop_time", 
                                                                                                                                                                                                                                                                "cmplt_flag", "req_no", "last_req_diff", "last_ride_diff"), row.names = c(NA, 
                                                                                                                                                                                                                                                                                                                                          8L), class = "data.frame")

樣本數據集中的最后一列是所需列,已通過子集上的excel公式實現。 我也可以使用下面的代碼來獲取最后一列,但是由於數據量巨大,這將花費大量時間。 下面的代碼將幫助您了解各種情況

data1<-as.data.frame(data1)
len<-length(data1$driver_id)+1
seq<-1
while (seq<len){
data1$last_ride_diff[seq]<-ifelse(data1$req_no[seq]>1,ifelse(data1$cmplt_flag[(seq-1)]==1,as.numeric(difftime(data1$req_made_time[seq],data1$ride_drop_time[(seq-1)],"mins")),last_ride_diff[(seq-1)]+last_req_diff[seq]),720)}

請提出一種使用更快的方法獲得所需值的方法,該方法可能是使用data.table或任何其他替代方法。 由於我在數據集中有許多driver_id,因此我需要為每個driver_id獲得理想的結果

這是一種可能的方法。 有用的na.locf函數需要package zoo (擴展值以最新的non-NA填充NA是任意方向)

library(data.table)
library(zoo)
dataT=as.data.table(data)[,-length(data),with=FALSE] # ensure data.table and remove wanted column
dataT[,drop_time_fill:=na.locf(ride_drop_time,na.rm=FALSE),by=driver_id]

dataout=dataT[,
.(ride_drop_time,
  drop_time_fill,
  last_req_dif=
ifelse(is.na(req_made_time - shift(req_made_time)),0,req_made_time - shift(req_made_time)),
  last_ride_diff =   req_made_time - shift(drop_time_fill,1)
  )
,by=driver_id]

dataout[
  is.na(dataout$last_ride_diff),
  last_ride_diff:=720+cumsum(last_req_diff[is.na(dataout$last_ride_diff)]),by=driver_id]

dataout

   driver_id      ride_drop_time      drop_time_fill last_req_dif  last_ride_diff
1:        69                <NA>                <NA>     0.000000 720.000000 mins
2:        69                <NA>                <NA>     9.016667 729.000000 mins
3:        69                <NA>                <NA>     9.783333 738.800000 mins
4:        69                <NA>                <NA>    46.916667 785.700000 mins
5:        69 2016-05-18 10:55:49 2016-05-18 10:55:49    11.883333 797.600000 mins
6:        69                <NA> 2016-05-18 10:55:49    20.500000   4.616667 mins
7:        69 2016-05-18 13:05:32 2016-05-18 13:05:32   104.833333 109.450000 mins
8:        69                <NA> 2016-05-18 13:05:32    65.666667  45.400000 mins

花了我一些時間弄清楚,但這是一個有趣的問題。 請注意,假設只有一條記錄last_req_dif = 0(開頭為猜測)

我無法測試取決於您的全部數據。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM