简体   繁体   English

使用R中的data.table有效地重塑预测数据

[英]reshaping prediction data efficiently using data.table in R

I am looking for a more efficient way of reshaping data.table data in R. 我正在寻找一种更有效的方法来重塑R中的data.table数据。

At the moment I am looping through to carry out a re-shape of multiple time series predictions. 目前,我正在循环以执行多个时间序列预测的重新形成。

I get the correct answer that I am after, however feel that the methodology is very inelegant/(un-data.table). 我得到了正确答案,但我觉得这种方法非常不优雅/(un-data.table)。 Therefore I am looking to the SO community to see if there is a more elegant solution. 因此,我期待SO社区看看是否有更优雅的解决方案。

Please see below for data setup and also two attempts at getting to the desired answer. 请参阅下面的数据设置以及两次尝试获得所需答案。

# load libraries
require(data.table)
require(lubridate)


# set up data assumptions
id_vec <- letters
len_id_vec <- length(id_vec)
num_orig_dates <- 7
set.seed(123)


# create original data frame
orig <- data.table(ID=rep(id_vec,each=num_orig_dates),
                   date=rep(c(Sys.Date() %m+% months(0: (num_orig_dates-1))),times=len_id_vec),
                   most_recent_bal=unlist(lapply(round(runif(len_id_vec)*100),function(y){
                     y*cumprod(1+rnorm(num_orig_dates,0.001,0.002))})))


# add 24 months ahead predictions of balances using a random walk from the original dates
nrow_orig <- nrow(orig)

for(i in seq(24)){
  orig[,paste0('pred',i,'_bal'):=most_recent_bal*(1+rnorm(nrow_orig,0.001,0.003))]
  orig[,paste0('pred',i,'_date'):=date %m+% months(i)]
}


# First attempt
t0 <- Sys.time()
tmp1 <- rbindlist(lapply(unique(orig$ID),function(x){
  orig1 <- orig[ID==x,]

  bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
  date_cols <- c('date',paste0('pred',seq(24),'_date'))

  # Go through each original date to realign predicted date and predicted balance
  date_vec <- orig1$date
  tmp <- rbindlist(lapply(date_vec,function(y){

    tmp <- data.table(dates=as.Date(as.vector(t(orig1[date==y,date_cols,with=FALSE]))),
                      bals=as.vector(t(orig1[date==y,bal_cols,with=FALSE])))
    tmp[,type:='prediction']
    tmp[,date_prediction_run:=y]

    # collect historical information too for plotting perposes.
    tmp1 <- orig1[date<=y,c('date','most_recent_bal'),with=FALSE]
    if(nrow(tmp1)!=0){

      setnames(tmp1,c('date','most_recent_bal'),c('dates','bals'))
      tmp1[,type:='history']
      tmp1[,date_prediction_run:=y]

      tmp <- rbind(tmp,tmp1)

    }

    tmp
  }))
  tmp[,ID:=x]
}))
t1 <- Sys.time()
t1-t0 #Time difference of 1.117216 secs

# Second Attempt: a slightly more data.table way which is faster but still very inelegant....
t2 <- Sys.time()
bal_cols <- c('most_recent_bal',paste0('pred',seq(24),'_bal'))
date_cols <- c('date',paste0('pred',seq(24),'_date'))
tmp1a <- rbindlist(lapply(unique(orig$ID),function(x){
  tmp <- cbind(melt(orig[ID==x,c('date',bal_cols),with=FALSE],
                    measure.vars = bal_cols,value.name='bals')[,-('variable'),with=FALSE],
               data.table(dates=melt(orig[ID==x,date_cols,with=FALSE],
                                     measure.vars = date_cols)[,value]))
  setnames(tmp,'date','date_prediction_run')
  tmp[,type:='prediction']

  tmp1 <- orig[ID==x,orig[ID==x & date<=.BY,c('date','most_recent_bal'),with=FALSE],by=date]
  setnames(tmp1,c('date_prediction_run','dates','bals'))
  tmp1[,type:='history']
  setcolorder(tmp1,colnames(tmp1)[match(colnames(tmp),colnames(tmp1))])
  tmp <- rbind(tmp,tmp1)
  tmp[,ID:=x]
  tmp
}))
t3 <- Sys.time()
t3-t2 # Time difference of 0.2309799 secs

Using data.table per the OP's request. 根据OP的请求使用data.table。

First, just showing how to build a data.table solution step-by-step 首先,只是展示如何逐步构建data.table解决方案

ie to break down what we are doing and, just for this first pass, be readable. 即打破我们正在做的事情,只是为了第一次通过,可读。 NB Afterwards, below, (in an update to follow shortly) I'll optimize the solution somewhat by pulling everything together eg by combining steps, chaining, in-place assignments, etc. The more optimized solution will, as you might expect, be far less readable without understanding the step-by-step presented here first with the aim of showing people learning data.table how they might arrive at a solution. NB接下来,在下面,(稍后会更新)我将通过将所有内容拉到一起来优化解决方案,例如通过组合步骤,链接,就地分配等。更优化的解决方案将如您所料,如果不理解这里首先介绍的步骤,目的是向人们展示学习数据,那么可读性就会低得多。可以说明他们如何找到解决方案。

# First Pass = Step-by-step (not optimized) just first work out a solution 

library(data.table)

# Transform prediction data from `orig` data.table into long format
# i.e. by melting pred#_bal and pred#_date columns
pred_data <- 
  data.table::melt( orig, 
                    measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"),  
                    value.name =     c("bals",           "date_prediction_run"))

pred_data[, type := "prediction"]  # add the 'type' column to pred_data (all are type="prediction")

# select desired columns in order
pred_data <- pred_data[, .( dates=date, bals, type, date_prediction_run, ID)] 


# Collect historical information from the most_recent_bal column, 
# which the OP wants for plotting purposes

graph_data <- 
  orig[ orig, 
        .(ID, dates=date, bals=most_recent_bal, date_prediction_run=x.date),
        on=.(ID, date>=date)]

graph_data[, type := "history"]  # these are all type="history" 

# final output, combining the prediction data and the graph data:
output <- rbindlist(list(pred_data, graph_data), use.names=TRUE)

UPDATE 3 = IMPORTANT NOTE: The code below does nothing to improve speed! 更新3 =重要注意:下面的代码没有提高速度!

Below is my "First pass at optimizing by combining some steps and chaining" However, even though below I have combined some steps, used chaining and it looks nice and short, the code below is no faster than the original step-by-step solution above as I'll show at the end of the post with benchmark timings. 下面是我的“通过组合一些步骤和链接进行优化的第一步”然而,尽管下面我已经结合了一些步骤,使用链接并且它看起来很好而且简短, 下面的代码并不比原始的逐步解决方案快以上我将在帖子的最后显示基准时间。 I'm leaving the code below as it illustrates a good point and presents a learning opportunity. 我将离开下面的代码,因为它说明了一个好点并提供了一个学习机会。

First pass at optimizing by combining some steps and chaining [not faster!] 首先通过组合一些步骤和链接来优化[不快!]
 library(data.table) # Transform prediction data into long format # by melting pred#_bal and pred#_date columns pred_data <- data.table::melt( orig[, type := "prediction"], #add the type column to orig, before melting measure = patterns("pred[0-9]+_bal", "pred[0-9]+_date"), value.name = c("bals", "date_prediction_run") )[, .( dates=date, bals, type, date_prediction_run, ID)] # chain, to select desired columns in order # FINAL RESULT: rbindlist pred_data to historic data pred_data <- rbindlist( list( pred_data, orig[ orig[, type := "history"], .(dates=date, bals=most_recent_bal, type, date_prediction_run=x.date, ID), on=.(ID, date>=date)] ), use.names=TRUE) 

Continuing UPDATE 3: 继续更新3:

Testing timings using the very handy microbenchmark package: 使用非常方便的microbenchmark软件包测试时间:

 Unit: milliseconds expr min lq mean median uq max neval hlm_first_attempt 1140.017957 1190.818176 1249.499493 1248.977454 1299.497679 1427.632140 100 hlm_second_attempt 231.380930 239.513223 254.702865 249.735005 262.516276 375.762675 100 krads_step.by.step 2.855509 2.985509 3.289648 3.059481 3.269429 6.568006 100 krads_optimized 2.909343 3.073837 3.555803 3.150584 3.554100 12.521439 100 
The benchmark results show the data.table solutions are huge timing improvements from the OP's solution. 基准测试结果表明,data.table解决方案是OP解决方案的巨大时序改进。 Great, that's what was asked for: We've shown how awesomely fast data.table can be but I hope also how it can be simple & readable too! 太棒了,这就是要求的:我们已经展示了data.table速度是多么快,但我也希望它也是如此简单易读 However, don't miss another lesson here: 但是,不要错过这里的另一课:

Looking at the microbenchmark results, note how both my solutions are effectively the same mean time. 查看微基准测试结果,请注意我的解决方案如何实际上是平均时间相同。 At first that might not make sense: Why is my "step-by-step" solution with so many more lines of code effectively just as fast as my attempted "optimized" solution? 起初可能没有意义:为什么我的“循序渐进”解决方案有这么多代码行的效率和我尝试的“优化”解决方案一样快?

Answer: If you look closely, all the same steps appear in both my solutions. 答:如果仔细观察,我的解决方案中会出现所有相同的步骤。 In my "optimized" solution, yes, we're chaining and you might at first think doing fewer assignments than the "step-by-step" literally spells out. 在我的“优化”解决方案中,是的,我们正在进行链接,你可能一开始认为执行的任务比“逐步”字样的更少。 But, as the benchmark results should tell you we have NOT done fewer assignments! 但是,正如基准测试结果应该告诉你我们没有完成更少的任务! Ie at each point where we use [] to "chain" together another operation, it is literally equivalent to assigning back to your original DT with <- . 即在我们使用[]将另一个操作“链接”在一起的每个点,它实际上相当于用<-分配回原始DT。

If you can wrap your head around that you'll be on your way to better programming: You can confidently skip the step of "chaining" and instead use <- to spell out a step-by-step (more readable, easier to debug and more maintainable) solution! 如果你可以解决这个问题,那么你将会更好地编程:你可以自信地跳过“链接”的步骤,而是使用<-来逐步说明(更易读,更容易调试)更可维护的解决方案!

Where you can save time it comes down to finding places to not assign multiple times unnecessarily in a loop or apply operation. 在可以节省时间的地方,它归结为在循环或应用操作中找不到多次不必要地分配的地方。 But that's a topic for another post I think! 但这是我认为的另一篇文章的主题!

NB In case you want to use microbenchmark on your own code, all I did was this: 注意如果你想在你自己的代码上使用microbenchmark ,我所做的就是:

 library(microbenchmark) mbm <- microbenchmark( hlm_first_attempt = { # Pasted in hlm's first solution, here }, hlm_second_attempt = { # Pasted in hlm's second solution, here }, krads_step.by.step = { # Pasted in my first solution, here }, krads_optimized = { # Pasted in my second solution, here }, times = 100L ) mbm 

If you want a graph, follow with: 如果您想要图表,请按以下步骤操作:

 library(ggplot2) autoplot(mbm) 

I tried this using dplyr and reshape2 for this, and I feel it is slightly more elegant (no apply which is technically for loops ). 我尝试使用dplyrreshape2进行此操作,我觉得它稍微优雅一点(在技术上 apply 用于循环 )。 It is also shaves off about 0.04 secs of runtime. 它还可以减少大约0.04秒的运行时间。

t0 = Sys.time()

# Extract predicted values in long form
trial_bal = reshape2::melt(orig, id.vars = c("ID", "date"), measure.vars = 
c(colnames(orig)[grep("pred[0-9]{1,}_bal", colnames(orig))]))
colnames(trial_bal) = c("ID", "date_prediction_run", "type", "balance")
trial_bal$type = gsub("_bal", "", trial_bal$type)

trial_date = reshape2::melt(orig, id.vars = c("ID", "date"), measure.vars = 
c(colnames(orig)[grep("pred[0-9]{1,}_date", colnames(orig))]))
colnames(trial_date) = c("ID", "date_prediction_run", "type", "dates")
trial_date$type = gsub("_date", "", trial_date$type)

trial = merge.data.frame(trial_date, trial_bal, by = c("ID", "date_prediction_run", "type"))
trial$type = "prediction"
trial = trial %>% select(dates, balance, type, date_prediction_run, ID)

# Extract historical values in long form
temp = orig[, c("ID", "date", "most_recent_bal")]
temp = merge(temp[, c("ID", "date")], temp, by = "ID", allow.cartesian = TRUE)
temp = temp[temp$date.x >= temp$date.y, ]
temp$type = "history"
temp = temp %>% select(dates = date.y, balance = most_recent_bal, type, 
date_prediction_run = date.x, ID)

# Combine prediction and history
trial = rbind(trial, temp)
trial = trial %>% arrange(ID, date_prediction_run, desc(type), dates)

t1 = Sys.time()
t1 - t0 #Time difference of 0.1900001 secs

This has 182 rows less than the number of rows you have because you have dates = date_prediction_run twice - one under type prediction and one under history . 这比您拥有的行数少182行,因为您有两次dates = date_prediction_run - 一个在type prediction ,一个在history下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM