[英]R: New column creation using lag values from other columns & many other conditions with data.table
以下是最后一列為所需列的示例數據
data<-structure(list(engagement_date = structure(c(16939, 16939, 16939,
16939, 16939, 16939, 16939, 16939), class = "Date"), driver_id = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "69", class = "factor"),
session_id = structure(1:8, .Label = c("16525506", "16526272",
"16527063", "16531156", "16532064", "16533490", "16541432",
"16547653", "16548040", "16553477", "16558000"), class = "factor"),
status = structure(c(3L, 2L, 3L, 4L, 1L, 3L, 1L, 2L), .Label = c("3",
"4", "6", "7"), class = "factor"), req_made_time = structure(c(1463556140,
1463556681, 1463557268, 1463560083, 1463560796, 1463562026,
1463568316, 1463572256), class = c("POSIXct", "POSIXt"), tzone = ""),
ride_drop_time = structure(c(NA, NA, NA, NA, 1463561749,
NA, 1463569532, NA), class = c("POSIXct", "POSIXt"), tzone = ""),
cmplt_flag = structure(c(1L, 1L, 1L, 1L, 2L, 1L, 2L, 1L), .Label = c("0",
"1"), class = "factor"), req_no = 1:8, last_req_diff = c(0,
9, 9.8, 46.9, 11.9, 20.5, 104.8, 65.7), last_ride_diff = c(720,
729, 738.8, 785.7, 797.6, 4.6, 109.4, 45.4)), .Names = c("engagement_date",
"driver_id", "session_id", "status", "req_made_time", "ride_drop_time",
"cmplt_flag", "req_no", "last_req_diff", "last_ride_diff"), row.names = c(NA,
8L), class = "data.frame")
樣本數據集中的最后一列是所需列,已通過子集上的excel公式實現。 我也可以使用下面的代碼來獲取最后一列,但是由於數據量巨大,這將花費大量時間。 下面的代碼將幫助您了解各種情況
data1<-as.data.frame(data1)
len<-length(data1$driver_id)+1
seq<-1
while (seq<len){
data1$last_ride_diff[seq]<-ifelse(data1$req_no[seq]>1,ifelse(data1$cmplt_flag[(seq-1)]==1,as.numeric(difftime(data1$req_made_time[seq],data1$ride_drop_time[(seq-1)],"mins")),last_ride_diff[(seq-1)]+last_req_diff[seq]),720)}
請提出一種使用更快的方法獲得所需值的方法,該方法可能是使用data.table或任何其他替代方法。 由於我在數據集中有許多driver_id,因此我需要為每個driver_id獲得理想的結果
這是一種可能的方法。 有用的na.locf
函數需要package zoo
(擴展值以最新的non-NA填充NA是任意方向)
library(data.table)
library(zoo)
dataT=as.data.table(data)[,-length(data),with=FALSE] # ensure data.table and remove wanted column
dataT[,drop_time_fill:=na.locf(ride_drop_time,na.rm=FALSE),by=driver_id]
dataout=dataT[,
.(ride_drop_time,
drop_time_fill,
last_req_dif=
ifelse(is.na(req_made_time - shift(req_made_time)),0,req_made_time - shift(req_made_time)),
last_ride_diff = req_made_time - shift(drop_time_fill,1)
)
,by=driver_id]
dataout[
is.na(dataout$last_ride_diff),
last_ride_diff:=720+cumsum(last_req_diff[is.na(dataout$last_ride_diff)]),by=driver_id]
dataout
driver_id ride_drop_time drop_time_fill last_req_dif last_ride_diff
1: 69 <NA> <NA> 0.000000 720.000000 mins
2: 69 <NA> <NA> 9.016667 729.000000 mins
3: 69 <NA> <NA> 9.783333 738.800000 mins
4: 69 <NA> <NA> 46.916667 785.700000 mins
5: 69 2016-05-18 10:55:49 2016-05-18 10:55:49 11.883333 797.600000 mins
6: 69 <NA> 2016-05-18 10:55:49 20.500000 4.616667 mins
7: 69 2016-05-18 13:05:32 2016-05-18 13:05:32 104.833333 109.450000 mins
8: 69 <NA> 2016-05-18 13:05:32 65.666667 45.400000 mins
花了我一些時間弄清楚,但這是一個有趣的問題。 請注意,假設只有一條記錄last_req_dif = 0(開頭為猜測)
我無法測試取決於您的全部數據。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.