[英]reshaping prediction data efficiently using data.table in R
I am looking for a more efficient way of reshaping data.table data in R. 我正在寻找一种更有效的方法来重塑R中的data.table数据。
At the moment I am looping through to carry out a re-shape of multiple time series predictions. 目前,我正在循环以执行多个时间序列预测的重新形成。
I get the correct answer that I am after, however feel that the methodology is very inelegant/(un-data.table). 我得到了正确答案,但我觉得这种方法非常不优雅/(un-data.table)。 Therefore I am looking to the SO community to see if there is a more elegant solution.
因此,我期待SO社区看看是否有更优雅的解决方案。
Please see below for data setup and also two attempts at getting to the desired answer. 请参阅下面的数据设置以及两次尝试获得所需答案。
# load libraries
require(data.table)
require(lubridate)
# set up data assumptions
id_vec <- letters
len_id_vec <- length(id_vec)
num_orig_dates <- 7
set.seed(123)
# create original data frame
orig <- data.table(ID=rep(id_vec,each=num_orig_dates),
date=rep(c(Sys.Date() %m+% months(0: (num_orig_dates-1))),times=len_id_vec),
most_recent_bal=unlist(lapply(round(runif(len_id_vec)*100),function(y){
y*cumprod(1+rnorm(num_orig_dates,0.001,0.002))})))
# add 24 months ahead predictions of balances using a random walk from the original dates
nrow_orig <- nrow(orig)
for(i in seq(24)){
orig[,paste0('pred',i,'_bal'):=most_recent_bal*(1+rnorm(nrow_orig,0.001,0.003))]
orig[,paste0('pred',i,'_date'):=date %m+% months(i)]
}
# First attempt
t0 <- Sys.time()
tmp1 <- rbindlist(lapply(unique(orig$ID),function(x){
orig1 <- orig[ID==x,]
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
# Go through each original date to realign predicted date and predicted balance
date_vec <- orig1$date
tmp <- rbindlist(lapply(date_vec,function(y){
tmp <- data.table(dates=as.Date(as.vector(t(orig1[date==y,date_cols,with=FALSE]))),
bals=as.vector(t(orig1[date==y,bal_cols,with=FALSE])))
tmp[,type:='prediction']
tmp[,date_prediction_run:=y]
# collect historical information too for plotting perposes.
tmp1 <- orig1[date<=y,c('date','most_recent_bal'),with=FALSE]
if(nrow(tmp1)!=0){
setnames(tmp1,c('date','most_recent_bal'),c('dates','bals'))
tmp1[,type:='history']
tmp1[,date_prediction_run:=y]
tmp <- rbind(tmp,tmp1)
}
tmp
}))
tmp[,ID:=x]
}))
t1 <- Sys.time()
t1-t0 #Time difference of 1.117216 secs
# Second Attempt: a slightly more data.table way which is faster but still very inelegant....
t2 <- Sys.time()
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
tmp1a <- rbindlist(lapply(unique(orig$ID),function(x){
tmp <- cbind(melt(orig[ID==x,c('date',bal_cols),with=FALSE],
measure.vars = bal_cols,value.name='bals')[,-('variable'),with=FALSE],
data.table(dates=melt(orig[ID==x,date_cols,with=FALSE],
measure.vars = date_cols)[,value]))
setnames(tmp,'date','date_prediction_run')
tmp[,type:='prediction']
tmp1 <- orig[ID==x,orig[ID==x & date<=.BY,c('date','most_recent_bal'),with=FALSE],by=date]
setnames(tmp1,c('date_prediction_run','dates','bals'))
tmp1[,type:='history']
setcolorder(tmp1,colnames(tmp1)[match(colnames(tmp),colnames(tmp1))])
tmp <- rbind(tmp,tmp1)
tmp[,ID:=x]
tmp
}))
t3 <- Sys.time()
t3-t2 # Time difference of 0.2309799 secs
data.table
solution step-by-step data.table
解决方案 ie to break down what we are doing and, just for this first pass, be readable. 即打破我们正在做的事情,只是为了第一次通过,可读。 NB Afterwards, below, (in an update to follow shortly) I'll optimize the solution somewhat by pulling everything together eg by combining steps, chaining, in-place assignments, etc. The more optimized solution will, as you might expect, be far less readable without understanding the step-by-step presented here first with the aim of showing people learning data.table how they might arrive at a solution.
NB接下来,在下面,(稍后会更新)我将通过将所有内容拉到一起来优化解决方案,例如通过组合步骤,链接,就地分配等。更优化的解决方案将如您所料,如果不理解这里首先介绍的步骤,目的是向人们展示学习数据,那么可读性就会低得多。可以说明他们如何找到解决方案。
# First Pass = Step-by-step (not optimized) just first work out a solution
library(data.table)
# Transform prediction data from `orig` data.table into long format
# i.e. by melting pred#_bal and pred#_date columns
pred_data <-
data.table::melt( orig,
measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),
value.name = c("bals", "date_prediction_run"))
pred_data[, type := "prediction"] # add the 'type' column to pred_data (all are type="prediction")
# select desired columns in order
pred_data <- pred_data[, .( dates=date, bals, type, date_prediction_run, ID)]
# Collect historical information from the most_recent_bal column,
# which the OP wants for plotting purposes
graph_data <-
orig[ orig,
.(ID, dates=date, bals=most_recent_bal, date_prediction_run=x.date),
on=.(ID, date>=date)]
graph_data[, type := "history"] # these are all type="history"
# final output, combining the prediction data and the graph data:
output <- rbindlist(list(pred_data, graph_data), use.names=TRUE)
Below is my "First pass at optimizing by combining some steps and chaining" However, even though below I have combined some steps, used chaining and it looks nice and short, the code below is no faster than the original step-by-step solution above as I'll show at the end of the post with benchmark timings. 下面是我的“通过组合一些步骤和链接进行优化的第一步”然而,尽管下面我已经结合了一些步骤,使用链接并且它看起来很好而且简短, 下面的代码并不比原始的逐步解决方案快以上我将在帖子的最后显示基准时间。 I'm leaving the code below as it illustrates a good point and presents a learning opportunity.
我将离开下面的代码,因为它说明了一个好点并提供了一个学习机会。
library(data.table) # Transform prediction data into long format # by melting pred#_bal and pred#_date columns pred_data <- data.table::melt( orig[, type := "prediction"], #add the type column to orig, before melting measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"), value.name = c("bals", "date_prediction_run") )[, .( dates=date, bals, type, date_prediction_run, ID)] # chain, to select desired columns in order # FINAL RESULT: rbindlist pred_data to historic data pred_data <- rbindlist( list( pred_data, orig[ orig[, type := "history"], .(dates=date, bals=most_recent_bal, type, date_prediction_run=x.date, ID), on=.(ID, date>=date)] ), use.names=TRUE)
Testing timings using the very handy microbenchmark
package: 使用非常方便的
microbenchmark
软件包测试时间:
Unit: milliseconds expr min lq mean median uq max neval hlm_first_attempt 1140.017957 1190.818176 1249.499493 1248.977454 1299.497679 1427.632140 100 hlm_second_attempt 231.380930 239.513223 254.702865 249.735005 262.516276 375.762675 100 krads_step.by.step 2.855509 2.985509 3.289648 3.059481 3.269429 6.568006 100 krads_optimized 2.909343 3.073837 3.555803 3.150584 3.554100 12.521439 100
The benchmark results show the data.table solutions are huge timing improvements from the OP's solution.
data.table
can be but I hope also how it can be simple & readable too!
data.table
速度是多么快,但我也希望它也是如此简单易读 !
However, don't miss another lesson here:
Looking at the microbenchmark results, note how both my solutions are effectively the same mean time. 查看微基准测试结果,请注意我的解决方案如何实际上是平均时间相同。 At first that might not make sense: Why is my "step-by-step" solution with so many more lines of code effectively just as fast as my attempted "optimized" solution?
起初可能没有意义:为什么我的“循序渐进”解决方案有这么多代码行的效率和我尝试的“优化”解决方案一样快?
Answer: If you look closely, all the same steps appear in both my solutions. 答:如果仔细观察,我的解决方案中会出现所有相同的步骤。 In my "optimized" solution, yes, we're chaining and you might at first think doing fewer assignments than the "step-by-step" literally spells out.
在我的“优化”解决方案中,是的,我们正在进行链接,你可能一开始认为执行的任务比“逐步”字样的更少。 But, as the benchmark results should tell you we have NOT done fewer assignments!
但是,正如基准测试结果应该告诉你我们没有完成更少的任务! Ie at each point where we use
[]
to "chain" together another operation, it is literally equivalent to assigning back to your original DT with <-
. 即在我们使用
[]
将另一个操作“链接”在一起的每个点,它实际上相当于用<-
分配回原始DT。
If you can wrap your head around that you'll be on your way to better programming: You can confidently skip the step of "chaining" and instead use <-
to spell out a step-by-step (more readable, easier to debug and more maintainable) solution! 如果你可以解决这个问题,那么你将会更好地编程:你可以自信地跳过“链接”的步骤,而是使用
<-
来逐步说明(更易读,更容易调试)更可维护的解决方案!
Where you can save time it comes down to finding places to not assign multiple times unnecessarily in a loop or apply operation. 在可以节省时间的地方,它归结为在循环或应用操作中找不到多次不必要地分配的地方。 But that's a topic for another post I think!
但这是我认为的另一篇文章的主题!
NB In case you want to use microbenchmark
on your own code, all I did was this: 注意如果你想在你自己的代码上使用
microbenchmark
,我所做的就是:
library(microbenchmark) mbm <- microbenchmark( hlm_first_attempt = { # Pasted in hlm's first solution, here }, hlm_second_attempt = { # Pasted in hlm's second solution, here }, krads_step.by.step = { # Pasted in my first solution, here }, krads_optimized = { # Pasted in my second solution, here }, times = 100L ) mbm
If you want a graph, follow with: 如果您想要图表,请按以下步骤操作:
library(ggplot2) autoplot(mbm)
I tried this using dplyr
and reshape2
for this, and I feel it is slightly more elegant (no apply
which is technically for loops ). 我尝试使用
dplyr
和reshape2
进行此操作,我觉得它稍微优雅一点(在技术上 apply
用于循环 )。 It is also shaves off about 0.04 secs of runtime. 它还可以减少大约0.04秒的运行时间。
t0 = Sys.time()
# Extract predicted values in long form
trial_bal = reshape2::melt(orig, id.vars = c("ID", "date"), measure.vars =
c(colnames(orig)[grep("pred[0-9]{1,}_bal", colnames(orig))]))
colnames(trial_bal) = c("ID", "date_prediction_run", "type", "balance")
trial_bal$type = gsub("_bal", "", trial_bal$type)
trial_date = reshape2::melt(orig, id.vars = c("ID", "date"), measure.vars =
c(colnames(orig)[grep("pred[0-9]{1,}_date", colnames(orig))]))
colnames(trial_date) = c("ID", "date_prediction_run", "type", "dates")
trial_date$type = gsub("_date", "", trial_date$type)
trial = merge.data.frame(trial_date, trial_bal, by = c("ID", "date_prediction_run", "type"))
trial$type = "prediction"
trial = trial %>% select(dates, balance, type, date_prediction_run, ID)
# Extract historical values in long form
temp = orig[, c("ID", "date", "most_recent_bal")]
temp = merge(temp[, c("ID", "date")], temp, by = "ID", allow.cartesian = TRUE)
temp = temp[temp$date.x >= temp$date.y, ]
temp$type = "history"
temp = temp %>% select(dates = date.y, balance = most_recent_bal, type,
date_prediction_run = date.x, ID)
# Combine prediction and history
trial = rbind(trial, temp)
trial = trial %>% arrange(ID, date_prediction_run, desc(type), dates)
t1 = Sys.time()
t1 - t0 #Time difference of 0.1900001 secs
This has 182 rows less than the number of rows you have because you have dates = date_prediction_run
twice - one under type
prediction
and one under history
. 这比您拥有的行数少182行,因为您有两次
dates = date_prediction_run
- 一个在type
prediction
,一个在history
下。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.