R：设置初始条件的for循环的dplyr解决方案

Question

I have a data which has 40 days of the year and some data 我有一个数据，一年有40天和一些数据

set.seed(123)
df <- data.frame(day = 1:40,rain = runif(40,min = 0, max = 3), petc = runif(40, min = 0.3, max = 8),swc = runif(40, min = 27.01, max = 117.43))

I want to calculate another variable called aetc for each day which is calculated as follows: 我想计算每天另一个名为aetc的变量，计算方法如下：

SW.ini <- 2 # setting some initial values 
SW.max <- 5
SW.min <- 0

For day 1, 第1天，

1) Determine a variable called PAW(day1) = SW.ini + rain(day1) 1）确定一个名为PAW(day1) = SW.ini + rain(day1)的变量PAW(day1) = SW.ini + rain(day1)

2) If PAW(day1) >= SWC(day1), aetc(day1) = petc(day1) ; 2）如果PAW(day1) >= SWC(day1), aetc(day1) = petc(day1) ;

If `PAW(day1) < SWC(day1), aetc(day1) = PAW(day1)/SWC(day1) * petc(day1)`

3) Check if aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1) 3）检查aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1) aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1)

4) Update SW(day1) = SW.ini + rain(day1) - aetc(day1) 4）更新SW(day1) = SW.ini + rain(day1) - aetc(day1)

5) If SW(day1) > SW.max, SW(day1) = SW.max. Similarly if 5）如果SW(day1) > SW.max, SW(day1) = SW.max. Similarly if SW(day1) > SW.max, SW(day1) = SW.max. Similarly if SW(day1) < SW.min, SW(day1) = SW.min` SW(day1) > SW.max, SW(day1) = SW.max. Similarly if SW（day1）<SW.min，SW（day1）= SW.min`

Repeat for day 2 重复第2天

1) Determine PAW(day2) = SW(day1) + rain(day2) 1）确定PAW(day2) = SW(day1) + rain(day2)
2) If PAW(day2) >= SWC(day2), aetc(day2) = petc(day2) ; 2）如果PAW(day2) >= SWC(day2), aetc(day2) = petc(day2) 2天） PAW(day2) >= SWC(day2), aetc(day2) = petc(day2) ; If PAW(day2) < SWC(day2), aetc(day2) = PAW(day2)/SWC(day2) * petc(day2) 如果PAW(day2) < SWC(day2), aetc(day2) = PAW(day2)/SWC(day2) * petc(day2)

3) Check if aetc(day2) > PAW(day2) . 3）检查aetc(day2) > PAW(day2) 。 If yes, aetc(day2) = paw(day2) 如果是， aetc(day2) = paw(day2)

4) Update SW(day2) = SW(day1) + rain(day2) - aetc(day2) 4）更新SW(day2) = SW(day1) + rain(day2) - aetc(day2)

5) If SW(day2) > SW.max, SW(day2) = SW.max. Similarly if 5）如果SW(day2) > SW.max, SW(day2) = SW.max. Similarly if SW(day2) > SW.max, SW(day2) = SW.max. Similarly if SW(day2) < SW.min, SW(day2) = SW.min` SW(day2) > SW.max, SW(day2) = SW.max. Similarly if SW（第2天）<SW.min，SW（第2天）= SW.min`

Here's my elegant for loop to do this: 这是我优雅的for循环来做到这一点：

      df$PAW <- NA
      df$aetc <- NA
      df$SW <- NA

      df$PAW[1] <- SW.ini + df$rain[1]

      df$aetc[1] <- ifelse(df$PAW[1] >= df$swc[1], df$petc[1],(df$PAW[1]/df$swc[1])*df$petc[1])
      df$aetc[1] <- ifelse(df$aetc[1] > df$PAW[1], df$PAW[1], df$aetc[1])
      df$SW[1] <- SW.ini + df$rain[1] -  df$aetc[1]
      df$SW[1] <- ifelse(df$SW[1] > SW.max, SW.max, ifelse(df$SW[1] < 0, 0,df$SW[1]))

      for (day in 2:nrow(df)){

        df$PAW[day] <- df$SW[day - 1] + df$rain[day]
        df$aetc[day] <- ifelse(df$PAW[day] >= df$swc[day], df$petc[day], (df$PAW[day]/df$swc[day]) * df$petc[day])
        df$aetc[day] <- ifelse(df$aetc[day] > df$PAW[day], df$PAW[day],df$aetc[day])
        df$SW[day] <- df$SW[day - 1] + df$rain[day] -  df$aetc[day]
        df$SW[day] <- ifelse(df$SW[day] > SW.max,SW.max, ifelse(df$SW[day] < 0, 0,df$SW[day]))
      }

My problem is that this is just one year of data and I want run it for multiple years. 我的问题是，这只是一年的数据，我想运行它多年。

      set.seed(123)
      df <- data.frame(year = 1980:2015, day = rep(1:40, each = 36),rain = 
      runif(40*36,min = 0, max = 3), petc = runif(40*36, min = 0.3, max = 8),swc = runif(40*36, min = 27.01, max = 117.43))

So I wanted to do something like 所以我想做点什么

                df %>% group_by(year) # and then run the above function for each year.

Is there a dplyr or any other solution to this? 是否有dplyr或任何其他解决方案？

Thank you 谢谢

Answer 1

Note: I originally posted this answer on your follow up question, R: for loop within a foreach loop , but after seeing this one, it seems this answer is far more relevant here. 注意：我最初在你的后续问题上发布了这个答案， R：for fore循环中的循环，但在看到这个之后，似乎这个答案在这里更具相关性。 (I don't address anything related to parallelizing in my answer, which was the topic of your follow up) . （我没有解决与我的答案中的并行化相关的任何问题，这是您跟进的主题） 。

Using `Rcpp` and `data.table` 使用`Rcpp`和`data.table`

Compiling the logic with C++ and applying it by group using data.table grouping operations gives a ~2,000x speed-up from your baseline, far greater than you might hope to get by parallelizing. 使用C ++编译逻辑并使用data.table分组操作按组应用它可以使您的基线速度提高约2,000倍，远远超过您希望通过并行化获得的速度。

On your original example, which had 39,420,000 rows , this executes on my machine in 1.883 seconds ; 在原始示例中，它有39,420,000行 ，这在我的机器上以1.883秒执行; and on the revised one with 28,800 rows , this executes in 0.004 seconds 在修改后的28,800行中 ，这将在0.004秒内执行

library(data.table)
library(Rcpp)

Define and compile a C++ function, CalcSW() inline in the R script: 在R脚本中内联定义并编译C++函数CalcSW() ：

One note: counting in C / C++ starts at 0 , unlike R , which starts at 1 -- that's why the indices are different here 一个注意事项： C / C++计数从0开始，与R不同，从1开始 - 这就是为什么这里的指数不同

Rcpp::cppFunction('
List CalcSW(NumericVector SW_ini,
            NumericVector SW_max,
            NumericVector rain,
            NumericVector swc,
            NumericVector PETc) {

  int n = SW_ini.length();
  NumericVector SW(n);
  NumericVector PAW(n);
  NumericVector aetc(n);

  double SW_ini_glob = SW_ini[0];
  double SW_max_glob = SW_max[0];

  SW[0] = SW_ini_glob;
  PAW[0] = SW[0] + rain[0];

  if (PAW[0] > swc[0]){
    aetc[0] = PETc[0];
  } else {
    aetc[0] = PAW[0]/swc[0]*PETc[0];
  }

  if (aetc[0] > PAW[0]){
    aetc[0] = PAW[0];
  }

  SW[0] = SW[0] + rain[0] - aetc[0];

  if(SW[0] > SW_max_glob){
    SW[0] = SW_max_glob;
  }

  if(SW[0] < 0){
    SW[0] = 0;
  }

  for (int i = 1; i < n; i++) {

    PAW[i] = SW[i-1] + rain[0];

    if (PAW[i] > swc[i]){
      aetc[i] = PETc[i];
    } else {
      aetc[i] = PAW[i]/swc[i]*PETc[i];
    }

    if (aetc[i] > PAW[i]){
      aetc[i] = PAW[i];
    }

    SW[i] = SW[i-1] + rain[i] - aetc[i];

    if(SW[i] > SW_max_glob){
      SW[i] = SW_max_glob;
    }

    if(SW[i] < 0){
     SW[i] = 0;
    }
  }
  return Rcpp::List::create(Rcpp::Named("SW") = SW,
                            Rcpp::Named("PAW") = PAW,
                            Rcpp::Named("aetc") = aetc);
}')

Create data.table 创建data.table

df <- data.table(loc.id = rep(1:10, each = 80*36), 
                 year = rep(rep(1980:2015, each = 80), times = 10),
                 day = rep(rep(1:80, times = 36),times = 10),
                 rain = runif(10*36*80, min = 0 , max = 5),
                 swc = runif(10*36*80,min = 0, max = 50),
                 SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
                 SW_ini = runif(10*36*80),
                 PETc = runif(10*36*80, min = 0 , max = 1.3),
                 SW = as.numeric(NA),
                 PAW = as.numeric(NA), 
                 aetc = as.numeric(NA))

setkey(df, loc.id, year, day)

Execute the function CalcSW() on the df for each combination of loc.id and year , assign returned values to the three columns simultaneously: 对于loc.id和year每个组合，在df上执行函数CalcSW() ，同时将返回值分配给三列：

system.time({
  df[,  c("SW","PAW","aetc") := CalcSW(SW_ini,
                                       SW_max,
                                       rain,
                                       swc,
                                       PETc), keyby = .(loc.id, year)]
})

... ...

   user  system elapsed 
  0.004   0.000   0.004

Results: 结果：

head(df)

... ...

   loc.id year day       rain       swc   SW_max     SW_ini      PETc       SW      PAW       aetc
1:      1 1980   1 0.35813251 28.360715 177.3943 0.69116310 0.2870478 1.038675 1.049296 0.01062025
2:      1 1980   2 1.10331116 37.013022 177.3943 0.02742273 0.4412420 2.125335 1.396808 0.01665171
3:      1 1980   3 1.76680011 32.509970 177.3943 0.66273062 1.1071233 3.807561 2.483467 0.08457420
4:      1 1980   4 3.20966558  8.252797 177.3943 0.12220454 0.3496968 6.840713 4.165693 0.17651342
5:      1 1980   5 1.32498191 14.784203 177.3943 0.66381497 1.2168838 7.573160 7.198845 0.59253503
6:      1 1980   6 0.02547458 47.903637 177.3943 0.21871598 1.0864713 7.418750 7.931292 0.17988449

I'm not 100% positive I implemented your logic perfectly, but the logic should be pretty straightforward to tweak where I may have missed something, I implemented it in a very similar manner to how you laid it out. 我并非100％肯定我完全实现了你的逻辑，但是在调整我可能错过的东西的逻辑应该非常简单，我以非常类似的方式实现它。

One other note: It's way easier to write C++ with auto-indenting and code highlighting (whether you're using RStudio or Emacs) you get if you create a separate file, named something like TestCode.cpp formatted like below. 另一个注意事项：如果你创建一个单独的文件，如下面的TestCode.cpp那样命名，那么用自动缩进和代码突出显示（无论你是使用RStudio还是Emacs）来编写C++更容易。

Then, you can either use Rcpp::sourceCpp("TestCode.cpp") to compile your function in your R Script, or you can copy and paste everything except for the first three lines as a character string into as an argument of Rcpp::cppFunction() like I did above. 然后，你可以使用Rcpp::sourceCpp("TestCode.cpp")在你的R脚本中编译你的函数，或者你可以将除了前三行之外的所有内容复制并粘贴为字符串作为Rcpp::cppFunction()的参数Rcpp::cppFunction()就像我上面做的那样。

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List CalcSW(NumericVector SW_ini,
                     NumericVector SW_max,
                     NumericVector rain,
                     NumericVector swc,
                     NumericVector PETc) {

  int n = SW_ini.length();
  NumericVector SW(n);
  NumericVector PAW(n);
  NumericVector aetc(n);

  double SW_ini_glob = SW_ini[0];
  double SW_max_glob = SW_max[0];

  SW[0] = SW_ini_glob;
  PAW[0] = SW[0] + rain[0];

  if (PAW[0] > swc[0]){
    aetc[0] = PETc[0];
  } else {
    aetc[0] = PAW[0]/swc[0]*PETc[0];
  }

  if (aetc[0] > PAW[0]){
    aetc[0] = PAW[0];
  }

  SW[0] = SW[0] + rain[0] - aetc[0];

  if(SW[0] > SW_max_glob){
    SW[0] = SW_max_glob;
  }

  if(SW[0] < 0){
    SW[0] = 0;
  }

  for (int i = 1; i < n; i++) {

    PAW[i] = SW[i-1] + rain[0];

    if (PAW[i] > swc[i]){
      aetc[i] = PETc[i];
    } else {
      aetc[i] = PAW[i]/swc[i]*PETc[i];
    }

    if (aetc[i] > PAW[i]){
      aetc[i] = PAW[i];
    }

    SW[i] = SW[i-1] + rain[i] - aetc[i];

    if(SW[i] > SW_max_glob){
      SW[i] = SW_max_glob;
    }

    if(SW[i] < 0){
      SW[i] = 0;
    }
  }
  return Rcpp::List::create(Rcpp::Named("SW") = SW,
                            Rcpp::Named("PAW") = PAW,
                            Rcpp::Named("aetc") = aetc);
}

Answer 2

You could wrap your code in another for loop and save each years df in a list: 您可以将代码包装在另一个for循环中，并将每年df保存在列表中：

library(tidyverse)
lst <- vector("list", length(unique(df$year)))
for (i in seq_along(unique(df$year))) {
    df_year <- df %>% filter(year == unique(df$year)[[i]])

    # rest of code with df_year replacing df

    lst[[i]] <- df_year
}
final_df <- bind_rows(lst)

Answer 3

The data.table illustration from Matt is a very good illustration of how fast data.table can be because it does the calculations in place with no copies and moving around of data. 来自Matt的data.table插图很好地说明了data.table速度有多快，因为它可以在没有副本和移动数据的情况下进行计算。

However, to answer the crux of your question about using pipes, you can use group_by along with do to accomplish what you are after (albeit much slower than data.table ) 但是，要回答关于使用管道的问题的关键，你可以使用group_by和do来完成你的目标（尽管比data.table慢得多）

Below I set up the same dummy data Matt did. 下面我设置了Matt所做的相同虚拟数据。 Then I use your function (but with the case fixed on PETc ). 然后我使用你的功能（但在PETc固定的情况下）。 It's not fast, but it's pretty easy to follow. 它并不快，但它很容易遵循。

df <- data.frame(loc.id = rep(1:10, each = 80*36), 
                 year = rep(rep(1980:2015, each = 80), times = 10),
                 day = rep(rep(1:80, times = 36),times = 10),
                 rain = runif(10*36*80, min = 0 , max = 5),
                 swc = runif(10*36*80,min = 0, max = 50),
                 SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
                 SW_ini = runif(10*36*80),
                 PETc = runif(10*36*80, min = 0 , max = 1.3) 
                 )

my_fun <- function(df){
  SW.ini <- 2 # setting some initial values 
  SW.max <- 5
  SW.min <- 0

  df$PAW <- NA
  df$aetc <- NA
  df$SW <- NA

  df$PAW[1] <- SW.ini + df$rain[1]

  df$aetc[1] <- ifelse(df$PAW[1] >= df$swc[1], df$PETc[1],(df$PAW[1]/df$swc[1])*df$PETc[1])
  df$aetc[1] <- ifelse(df$aetc[1] > df$PAW[1], df$PAW[1], df$aetc[1])
  df$SW[1] <- SW.ini + df$rain[1] -  df$aetc[1]
  df$SW[1] <- ifelse(df$SW[1] > SW.max, SW.max, ifelse(df$SW[1] < 0, 0,df$SW[1]))

  for (day in 2:nrow(df)){

    df$PAW[day] <- df$SW[day - 1] + df$rain[day]
    df$aetc[day] <- ifelse(df$PAW[day] >= df$swc[day], df$PETc[day], (df$PAW[day]/df$swc[day]) * df$PETc[day])
    df$aetc[day] <- ifelse(df$aetc[day] > df$PAW[day], df$PAW[day],df$aetc[day])
    df$SW[day] <- df$SW[day - 1] + df$rain[day] -  df$aetc[day]
    df$SW[day] <- ifelse(df$SW[day] > SW.max,SW.max, ifelse(df$SW[day] < 0, 0,df$SW[day]))
  }
  return(df)
}


library(tictoc)
library(tidyverse)


tic()
df  %>% 
  group_by(year) %>%
  do(my_fun(.)) -> 
  out
toc()
#> 5.075 sec elapsed

R：设置初始条件的for循环的dplyr解决方案

问题描述

3 个解决方案

解决方案1
5 已采纳 2018-03-06 22:51:47

Using `Rcpp` and `data.table` 使用`Rcpp`和`data.table`

解决方案2
1 2018-03-05 21:13:06

解决方案3
1 2018-05-21 20:27:31

R：设置初始条件的for循环的dplyr解决方案

问题描述

3 个解决方案

解决方案1 5 已采纳 2018-03-06 22:51:47

Using Rcpp and data.table 使用Rcpp和data.table

解决方案2 1 2018-03-05 21:13:06

解决方案3 1 2018-05-21 20:27:31

解决方案1
5 已采纳 2018-03-06 22:51:47

Using `Rcpp` and `data.table` 使用`Rcpp`和`data.table`

解决方案2
1 2018-03-05 21:13:06

解决方案3
1 2018-05-21 20:27:31