如何在data.table中编写累积计算

Question

A sequential, cumulative calculation 连续的累积计算

I need to make a time-series calculation, where the value calculated in each row depends on the result calculated in the previous row. 我需要进行时间序列计算，其中每行计算的值取决于前一行中计算的结果。 I am hoping to use the convenience of data.table . 我希望使用data.table的便利。 The actual problem is a hydrological model -- a cumulative water balance calculation, adding rainfall at each time step and subtracting runoff and evaporation as a function of the current water volume. 实际问题是水文模型 - 累积水量平衡计算，在每个时间步骤增加降雨量，减去径流和蒸发量作为当前水量的函数。 The dataset includes different basins and scenarios (groups). 数据集包括不同的盆地和场景（组）。 Here I will use a simpler illustration of the problem. 在这里，我将使用更简单的问题说明。

A simplified example of the calculation looks like this, for each time step (row) i : 计算的简化示例如下所示，对于每个时间步（行） i ：

 v[i] <- a[i] + b[i] * v[i-1]

a and b are vectors of parameter values, and v is the result vector. a和b是参数值的矢量， v是结果矢量。 For the first row ( i == 1 ) the initial value of v is taken as v0 = 0 . 对于第一行（ i == 1 ）， v的初始值被视为v0 = 0 。

First attempt 第一次尝试

My first thought was to use shift() in data.table . 我的第一个想法是在data.table使用shift() 。 A minimal example, including the desired result v.ans , is 一个最小的例子，包括所需的结果v.ans ，是

library(data.table)        # version 1.9.7
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321) )
DT
#    a   b v.ans
# 1: 1 0.1 1.000
# 2: 2 0.1 2.100
# 3: 3 0.1 3.210
# 4: 4 0.1 4.321

DT[, v := NA]   # initialize v
DT[, v := a + b * ifelse(is.na(shift(v)), 0, shift(v))][]
#    a   b v.ans v
# 1: 1 0.1 1.000 1
# 2: 2 0.1 2.100 2
# 3: 3 0.1 3.210 3
# 4: 4 0.1 4.321 4

This doesn't work, because shift(v) gives a copy of the original column v , shifted by 1 row. 这不起作用，因为shift(v)给出原始列v的副本，移位1行。 It is unaffected by assignment to v . 它不受赋值给v 。

I also considered building the equation using cumsum() and cumprod(), but that won't work either. 我还考虑使用cumsum（）和cumprod（）构建方程式，但这也不起作用。

Brute force approach 蛮力方法

So I resort to a for loop inside a function for convenience: 所以为方便起见，我在函数内部使用for循环：

vcalc <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))      # initialize v
  for (i in 1:length(a)) {
    v[i] <- a[i] + b[i] * ifelse(i==1, v0, v[i-1])
  }
  return(v)
}

This cumulative function works fine with data.table: 这个累积函数适用于data.table：

DT[, v := vcalc(a, b, 0)][]
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321
identical(DT$v, DT$v.ans)
# [1] TRUE

My question 我的问题

My question is, can I write this calculation in a more concise and efficient data.table way, without having to use the for loop and/or function definition? 我的问题是，我可以用更简洁有效的data.table方式编写这个计算，而不必使用for循环和/或函数定义吗？ Using set() perhaps? 或许使用set() ？

Or is there a better approach all together? 或者是否有更好的方法？

Edit: A better loop 编辑：更好的循环

David's Rcpp solution below inspired me to remove the ifelse() from the for loop: David的Rcpp解决方案激发了我从for循环中删除ifelse() ：

vcalc2 <- function(a, b, v0 = 0) {
  v <- rep(NA, length(a))
  for (i in 1:length(a)) {
    v0 <- v[i] <- a[i] + b[i] * v0
  }
  return(v)
}

vcalc2() is 60% faster than vcalc() . vcalc2()比vcalc()快60％。

Answer 1

It may not be 100% what you are looking for, as it does not use the "data.table-way" and still uses a for-loop. 它可能不是你想要的100％，因为它不使用“data.table-way”并且仍然使用for循环。 However, this approach should be faster (I assume you want to use data.table and the data.table-way to speed up your code). 但是，这种方法应该更快（我假设你想使用data.table和data.table-way来加速你的代码）。 I leverage Rcpp to write a short function called HydroFun , that can be used in R like any other function (you just need to source the function first). 我利用Rcpp编写一个名为HydroFun的简短函数，可以像任何其他函数一样在R中使用（您只需要首先获取函数）。 My gut-feeling tells me that the data.table way (if existent) is pretty complicated because you cannot compute a closed-form solution (but I may be wrong on this point...). 我的直觉告诉我，data.table方式（如果存在）非常复杂，因为你无法计算封闭形式的解决方案（但我可能在这一点上错了......）。

My approach looks like this: 我的方法如下：

The Rcpp function looks like this (in the file: hydrofun.cpp ): Rcpp函数看起来像这样（在文件中： hydrofun.cpp ）：

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
NumericVector HydroFun(NumericVector a, NumericVector b, double v0 = 0.0) {
  // get the size of the vectors
  int vecSize = a.length();

  // initialize a numeric vector "v" (for the result)
  NumericVector v(vecSize);

   // compute v_0
  v[0] = a[0] + b[0] * v0;

  // loop through the vector and compute the new value
  for (int i = 1; i < vecSize; ++i) {
    v[i] = a[i] + b[i] * v[i - 1];
  }
  return v;
}

To source and use the function in R you can do: 要在R中获取和使用该函数，您可以：

Rcpp::sourceCpp("hydrofun.cpp")

library(data.table)
DT <- data.table(a = 1:4, 
                 b = 0.1,
                 v.ans = c(1, 2.1, 3.21, 4.321))

DT[, v_ans2 := HydroFun(a, b, 0)]
DT
# a   b v.ans v_ans2
# 1: 1 0.1 1.000  1.000
# 2: 2 0.1 2.100  2.100
# 3: 3 0.1 3.210  3.210
# 4: 4 0.1 4.321  4.321

Which gives the result you are looking for (at least from the value-perspective). 这给出了您正在寻找的结果（至少从价值观角度来看）。

Comparing the speeds reveals a speed-up of roughly 65x. 比较速度显示加速大约65倍。

library(microbenchmark)
n <- 10000
dt <- data.table(a = 1:n,
                 b = rnorm(n))

microbenchmark(dt[, v1 := vcalc(a, b, 0)],
               dt[, v2 := HydroFun(a, b, 0)])
# Unit: microseconds
# expr                                min        lq       mean    median         uq       max neval
# dt[, `:=`(v1, vcalc(a, b, 0))]    28369.672 30203.398 31883.9872 31651.566 32646.8780 68727.433   100
# dt[, `:=`(v2, HydroFun(a, b, 0))]   381.307   421.697   512.2957   512.717   560.8585  1496.297   100

identical(dt$v1, dt$v2)
# [1] TRUE

Does that help you in any way? 这对你有什么帮助吗？

Answer 2

I think Reduce together with accumulate = TRUE is a commonly used technique for these types of calculations (see eg recursively using the output as an input for a function ). 我认为Reduce与accumulate = TRUE一起是这些类型的计算的常用技术（例如，参见递归使用输出作为函数的输入）。 It is not necessarily faster than a well-written loop*, and I don't know how data.table -esque you believe it is, still I want to suggest it for your toolbox. 它不一定比编写良好的循环*快，而且我不知道你认为它是多少data.table -esque，我仍然想为你的工具箱建议它。

DT[ , v := 0][
  , v := Reduce(f = function(v, i) a[i] + b[i] * v, x = .I[-1], init = a[1], accumulate = TRUE)]

DT
#    a   b v.ans     v
# 1: 1 0.1 1.000 1.000
# 2: 2 0.1 2.100 2.100
# 3: 3 0.1 3.210 3.210
# 4: 4 0.1 4.321 4.321

Explanation: 说明：

Set initial value of v to 0 ( v := 0 ). 将v的初始值设置为0 （ v := 0 ）。 Use Reduce to apply function f on an integer vector of row numbers except the first row ( x = .I[-1] ). 使用Reduce将函数f应用于除第一行（ x = .I[-1] ）之外的行数的整数向量。 Instead add a[1] to the start of of x ( init = a[1] ). 而是将a[1]添加到x的开头（ init = a[1] ）。 Reduce then "successively applies f to the elements [...] from left to right". 然后Reduce “从左到右连续应用f到元素[...]”。 The successive reduce combinations are "accumulated" ( accumulate = TRUE ). 连续的减少组合是“累积的”（ accumulate = TRUE ）。

*See eg here , where you also can read more about Reduce in this section . *请参阅例如在这里，在这里你还可以阅读更多关于Reduce在本节。

如何在data.table中编写累积计算

问题描述

A sequential, cumulative calculation 连续的累积计算

First attempt 第一次尝试

Brute force approach 蛮力方法

My question 我的问题

Edit: A better loop 编辑：更好的循环

2 个解决方案

解决方案1
7 已采纳 2016-11-03 23:50:28

解决方案2
2 2016-11-04 11:06:22

如何在data.table中编写累积计算

问题描述

A sequential, cumulative calculation 连续的累积计算

First attempt 第一次尝试

Brute force approach 蛮力方法

My question 我的问题

Edit: A better loop 编辑：更好的循环

2 个解决方案

解决方案1 7 已采纳 2016-11-03 23:50:28

解决方案2 2 2016-11-04 11:06:22

解决方案1
7 已采纳 2016-11-03 23:50:28

解决方案2
2 2016-11-04 11:06:22