[英]reshaping prediction data efficiently using data.table in R
我正在尋找一種更有效的方法來重塑R中的data.table數據。
目前,我正在循環以執行多個時間序列預測的重新形成。
我得到了正確答案,但我覺得這種方法非常不優雅/(un-data.table)。 因此,我期待SO社區看看是否有更優雅的解決方案。
請參閱下面的數據設置以及兩次嘗試獲得所需答案。
# load libraries
require(data.table)
require(lubridate)
# set up data assumptions
id_vec <- letters
len_id_vec <- length(id_vec)
num_orig_dates <- 7
set.seed(123)
# create original data frame
orig <- data.table(ID=rep(id_vec,each=num_orig_dates),
date=rep(c(Sys.Date() %m+% months(0: (num_orig_dates-1))),times=len_id_vec),
most_recent_bal=unlist(lapply(round(runif(len_id_vec)*100),function(y){
y*cumprod(1+rnorm(num_orig_dates,0.001,0.002))})))
# add 24 months ahead predictions of balances using a random walk from the original dates
nrow_orig <- nrow(orig)
for(i in seq(24)){
orig[,paste0('pred',i,'_bal'):=most_recent_bal*(1+rnorm(nrow_orig,0.001,0.003))]
orig[,paste0('pred',i,'_date'):=date %m+% months(i)]
}
# First attempt
t0 <- Sys.time()
tmp1 <- rbindlist(lapply(unique(orig$ID),function(x){
orig1 <- orig[ID==x,]
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
# Go through each original date to realign predicted date and predicted balance
date_vec <- orig1$date
tmp <- rbindlist(lapply(date_vec,function(y){
tmp <- data.table(dates=as.Date(as.vector(t(orig1[date==y,date_cols,with=FALSE]))),
bals=as.vector(t(orig1[date==y,bal_cols,with=FALSE])))
tmp[,type:='prediction']
tmp[,date_prediction_run:=y]
# collect historical information too for plotting perposes.
tmp1 <- orig1[date<=y,c('date','most_recent_bal'),with=FALSE]
if(nrow(tmp1)!=0){
setnames(tmp1,c('date','most_recent_bal'),c('dates','bals'))
tmp1[,type:='history']
tmp1[,date_prediction_run:=y]
tmp <- rbind(tmp,tmp1)
}
tmp
}))
tmp[,ID:=x]
}))
t1 <- Sys.time()
t1-t0 #Time difference of 1.117216 secs
# Second Attempt: a slightly more data.table way which is faster but still very inelegant....
t2 <- Sys.time()
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
tmp1a <- rbindlist(lapply(unique(orig$ID),function(x){
tmp <- cbind(melt(orig[ID==x,c('date',bal_cols),with=FALSE],
measure.vars = bal_cols,value.name='bals')[,-('variable'),with=FALSE],
data.table(dates=melt(orig[ID==x,date_cols,with=FALSE],
measure.vars = date_cols)[,value]))
setnames(tmp,'date','date_prediction_run')
tmp[,type:='prediction']
tmp1 <- orig[ID==x,orig[ID==x & date<=.BY,c('date','most_recent_bal'),with=FALSE],by=date]
setnames(tmp1,c('date_prediction_run','dates','bals'))
tmp1[,type:='history']
setcolorder(tmp1,colnames(tmp1)[match(colnames(tmp),colnames(tmp1))])
tmp <- rbind(tmp,tmp1)
tmp[,ID:=x]
tmp
}))
t3 <- Sys.time()
t3-t2 # Time difference of 0.2309799 secs
data.table
解決方案 即打破我們正在做的事情,只是為了第一次通過,可讀。 NB接下來,在下面,(稍后會更新)我將通過將所有內容拉到一起來優化解決方案,例如通過組合步驟,鏈接,就地分配等。更優化的解決方案將如您所料,如果不理解這里首先介紹的步驟,目的是向人們展示學習數據,那么可讀性就會低得多。可以說明他們如何找到解決方案。
# First Pass = Step-by-step (not optimized) just first work out a solution
library(data.table)
# Transform prediction data from `orig` data.table into long format
# i.e. by melting pred#_bal and pred#_date columns
pred_data <-
data.table::melt( orig,
measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),
value.name = c("bals", "date_prediction_run"))
pred_data[, type := "prediction"] # add the 'type' column to pred_data (all are type="prediction")
# select desired columns in order
pred_data <- pred_data[, .( dates=date, bals, type, date_prediction_run, ID)]
# Collect historical information from the most_recent_bal column,
# which the OP wants for plotting purposes
graph_data <-
orig[ orig,
.(ID, dates=date, bals=most_recent_bal, date_prediction_run=x.date),
on=.(ID, date>=date)]
graph_data[, type := "history"] # these are all type="history"
# final output, combining the prediction data and the graph data:
output <- rbindlist(list(pred_data, graph_data), use.names=TRUE)
下面是我的“通過組合一些步驟和鏈接進行優化的第一步”然而,盡管下面我已經結合了一些步驟,使用鏈接並且它看起來很好而且簡短, 下面的代碼並不比原始的逐步解決方案快以上我將在帖子的最后顯示基准時間。 我將離開下面的代碼,因為它說明了一個好點並提供了一個學習機會。
首先通過組合一些步驟和鏈接來優化[不快!] library(data.table) # Transform prediction data into long format # by melting pred#_bal and pred#_date columns pred_data <- data.table::melt( orig[, type := "prediction"], #add the type column to orig, before melting measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"), value.name = c("bals", "date_prediction_run") )[, .( dates=date, bals, type, date_prediction_run, ID)] # chain, to select desired columns in order # FINAL RESULT: rbindlist pred_data to historic data pred_data <- rbindlist( list( pred_data, orig[ orig[, type := "history"], .(dates=date, bals=most_recent_bal, type, date_prediction_run=x.date, ID), on=.(ID, date>=date)] ), use.names=TRUE)
使用非常方便的microbenchmark
軟件包測試時間:
Unit: milliseconds expr min lq mean median uq max neval hlm_first_attempt 1140.017957 1190.818176 1249.499493 1248.977454 1299.497679 1427.632140 100 hlm_second_attempt 231.380930 239.513223 254.702865 249.735005 262.516276 375.762675 100 krads_step.by.step 2.855509 2.985509 3.289648 3.059481 3.269429 6.568006 100 krads_optimized 2.909343 3.073837 3.555803 3.150584 3.554100 12.521439 100
基准測試結果表明,data.table解決方案是OP解決方案的巨大時序改進。
太棒了,這就是要求的:我們已經展示了data.table
速度是多么快,但我也希望它也是如此簡單易讀 !
但是,不要錯過這里的另一課:
查看微基准測試結果,請注意我的解決方案如何實際上是平均時間相同。 起初可能沒有意義:為什么我的“循序漸進”解決方案有這么多代碼行的效率和我嘗試的“優化”解決方案一樣快?
答:如果仔細觀察,我的解決方案中會出現所有相同的步驟。 在我的“優化”解決方案中,是的,我們正在進行鏈接,你可能一開始認為執行的任務比“逐步”字樣的更少。 但是,正如基准測試結果應該告訴你我們沒有完成更少的任務! 即在我們使用[]
將另一個操作“鏈接”在一起的每個點,它實際上相當於用<-
分配回原始DT。
如果你可以解決這個問題,那么你將會更好地編程:你可以自信地跳過“鏈接”的步驟,而是使用<-
來逐步說明(更易讀,更容易調試)更可維護的解決方案!
在可以節省時間的地方,它歸結為在循環或應用操作中找不到多次不必要地分配的地方。 但這是我認為的另一篇文章的主題!
注意如果你想在你自己的代碼上使用microbenchmark
,我所做的就是:
library(microbenchmark) mbm <- microbenchmark( hlm_first_attempt = { # Pasted in hlm's first solution, here }, hlm_second_attempt = { # Pasted in hlm's second solution, here }, krads_step.by.step = { # Pasted in my first solution, here }, krads_optimized = { # Pasted in my second solution, here }, times = 100L ) mbm
如果您想要圖表,請按以下步驟操作:
library(ggplot2) autoplot(mbm)
我嘗試使用dplyr
和reshape2
進行此操作,我覺得它稍微優雅一點(在技術上 apply
用於循環 )。 它還可以減少大約0.04秒的運行時間。
t0 = Sys.time()
# Extract predicted values in long form
trial_bal = reshape2::melt(orig, id.vars = c("ID", "date"), measure.vars =
c(colnames(orig)[grep("pred[0-9]{1,}_bal", colnames(orig))]))
colnames(trial_bal) = c("ID", "date_prediction_run", "type", "balance")
trial_bal$type = gsub("_bal", "", trial_bal$type)
trial_date = reshape2::melt(orig, id.vars = c("ID", "date"), measure.vars =
c(colnames(orig)[grep("pred[0-9]{1,}_date", colnames(orig))]))
colnames(trial_date) = c("ID", "date_prediction_run", "type", "dates")
trial_date$type = gsub("_date", "", trial_date$type)
trial = merge.data.frame(trial_date, trial_bal, by = c("ID", "date_prediction_run", "type"))
trial$type = "prediction"
trial = trial %>% select(dates, balance, type, date_prediction_run, ID)
# Extract historical values in long form
temp = orig[, c("ID", "date", "most_recent_bal")]
temp = merge(temp[, c("ID", "date")], temp, by = "ID", allow.cartesian = TRUE)
temp = temp[temp$date.x >= temp$date.y, ]
temp$type = "history"
temp = temp %>% select(dates = date.y, balance = most_recent_bal, type,
date_prediction_run = date.x, ID)
# Combine prediction and history
trial = rbind(trial, temp)
trial = trial %>% arrange(ID, date_prediction_run, desc(type), dates)
t1 = Sys.time()
t1 - t0 #Time difference of 0.1900001 secs
這比您擁有的行數少182行,因為您有兩次dates = date_prediction_run
- 一個在type
prediction
,一個在history
下。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.