简体   繁体   English

R:设置初始条件的for循环的dplyr解决方案

[英]R: dplyr solution for for-loop with initial conditions set

I have a data which has 40 days of the year and some data 我有一个数据,一年有40天和一些数据

set.seed(123)
df <- data.frame(day = 1:40,rain = runif(40,min = 0, max = 3), petc = runif(40, min = 0.3, max = 8),swc = runif(40, min = 27.01, max = 117.43))

I want to calculate another variable called aetc for each day which is calculated as follows: 我想计算每天另一个名为aetc的变量,计算方法如下:

SW.ini <- 2 # setting some initial values 
SW.max <- 5
SW.min <- 0

For day 1, 第1天,

1) Determine a variable called PAW(day1) = SW.ini + rain(day1) 1)确定一个名为PAW(day1) = SW.ini + rain(day1)的变量PAW(day1) = SW.ini + rain(day1)

2) If PAW(day1) >= SWC(day1), aetc(day1) = petc(day1) ; 2)如果PAW(day1) >= SWC(day1), aetc(day1) = petc(day1) ;

If `PAW(day1) < SWC(day1), aetc(day1) = PAW(day1)/SWC(day1) * petc(day1)`

3) Check if aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1) 3)检查aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1) aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1)

4) Update SW(day1) = SW.ini + rain(day1) - aetc(day1) 4)更新SW(day1) = SW.ini + rain(day1) - aetc(day1)

5) If SW(day1) > SW.max, SW(day1) = SW.max. Similarly if 5)如果SW(day1) > SW.max, SW(day1) = SW.max. Similarly if SW(day1) > SW.max, SW(day1) = SW.max. Similarly if SW(day1) < SW.min, SW(day1) = SW.min` SW(day1) > SW.max, SW(day1) = SW.max. Similarly if SW(day1)<SW.min,SW(day1)= SW.min`

Repeat for day 2 重复第2天

1) Determine PAW(day2) = SW(day1) + rain(day2) 1)确定PAW(day2) = SW(day1) + rain(day2)
2) If PAW(day2) >= SWC(day2), aetc(day2) = petc(day2) ; 2)如果PAW(day2) >= SWC(day2), aetc(day2) = petc(day2) 2天) PAW(day2) >= SWC(day2), aetc(day2) = petc(day2) ; If PAW(day2) < SWC(day2), aetc(day2) = PAW(day2)/SWC(day2) * petc(day2) 如果PAW(day2) < SWC(day2), aetc(day2) = PAW(day2)/SWC(day2) * petc(day2)

3) Check if aetc(day2) > PAW(day2) . 3)检查aetc(day2) > PAW(day2) If yes, aetc(day2) = paw(day2) 如果是, aetc(day2) = paw(day2)

4) Update SW(day2) = SW(day1) + rain(day2) - aetc(day2) 4)更新SW(day2) = SW(day1) + rain(day2) - aetc(day2)

5) If SW(day2) > SW.max, SW(day2) = SW.max. Similarly if 5)如果SW(day2) > SW.max, SW(day2) = SW.max. Similarly if SW(day2) > SW.max, SW(day2) = SW.max. Similarly if SW(day2) < SW.min, SW(day2) = SW.min` SW(day2) > SW.max, SW(day2) = SW.max. Similarly if SW(第2天)<SW.min,SW(第2天)= SW.min`

Here's my elegant for loop to do this: 这是我优雅的for循环来做到这一点:

      df$PAW <- NA
      df$aetc <- NA
      df$SW <- NA

      df$PAW[1] <- SW.ini + df$rain[1]

      df$aetc[1] <- ifelse(df$PAW[1] >= df$swc[1], df$petc[1],(df$PAW[1]/df$swc[1])*df$petc[1])
      df$aetc[1] <- ifelse(df$aetc[1] > df$PAW[1], df$PAW[1], df$aetc[1])
      df$SW[1] <- SW.ini + df$rain[1] -  df$aetc[1]
      df$SW[1] <- ifelse(df$SW[1] > SW.max, SW.max, ifelse(df$SW[1] < 0, 0,df$SW[1]))

      for (day in 2:nrow(df)){

        df$PAW[day] <- df$SW[day - 1] + df$rain[day]
        df$aetc[day] <- ifelse(df$PAW[day] >= df$swc[day], df$petc[day], (df$PAW[day]/df$swc[day]) * df$petc[day])
        df$aetc[day] <- ifelse(df$aetc[day] > df$PAW[day], df$PAW[day],df$aetc[day])
        df$SW[day] <- df$SW[day - 1] + df$rain[day] -  df$aetc[day]
        df$SW[day] <- ifelse(df$SW[day] > SW.max,SW.max, ifelse(df$SW[day] < 0, 0,df$SW[day]))
      }

My problem is that this is just one year of data and I want run it for multiple years. 我的问题是,这只是一年的数据,我想运行它多年。

      set.seed(123)
      df <- data.frame(year = 1980:2015, day = rep(1:40, each = 36),rain = 
      runif(40*36,min = 0, max = 3), petc = runif(40*36, min = 0.3, max = 8),swc = runif(40*36, min = 27.01, max = 117.43))

So I wanted to do something like 所以我想做点什么

                df %>% group_by(year) # and then run the above function for each year. 

Is there a dplyr or any other solution to this? 是否有dplyr或任何其他解决方案?

Thank you 谢谢

Note: I originally posted this answer on your follow up question, R: for loop within a foreach loop , but after seeing this one, it seems this answer is far more relevant here. 注意:我最初在你的后续问题上发布了这个答案, R:for fore循环中的循环 ,但在看到这个之后,似乎这个答案在这里更具相关性。 (I don't address anything related to parallelizing in my answer, which was the topic of your follow up) . (我没有解决与我的答案中的并行化相关的任何问题,这是您跟进的主题)

Using Rcpp and data.table 使用Rcppdata.table

Compiling the logic with C++ and applying it by group using data.table grouping operations gives a ~2,000x speed-up from your baseline, far greater than you might hope to get by parallelizing. 使用C ++编译逻辑并使用data.table分组操作按组应用它可以使您的基线速度提高约2,000倍,远远超过您希望通过并行化获得的速度。

On your original example, which had 39,420,000 rows , this executes on my machine in 1.883 seconds ; 在原始示例中,它有39,420,000行 ,这在我的机器上以1.883秒执行; and on the revised one with 28,800 rows , this executes in 0.004 seconds 在修改后的28,800行中 ,这将在0.004秒内执行

library(data.table)
library(Rcpp)

Define and compile a C++ function, CalcSW() inline in the R script: 在R脚本中内联定义并编译C++函数CalcSW()

One note: counting in C / C++ starts at 0 , unlike R , which starts at 1 -- that's why the indices are different here 一个注意事项: C / C++计数从0开始,与R不同,从1开始 - 这就是为什么这里的指数不同

Rcpp::cppFunction('
List CalcSW(NumericVector SW_ini,
            NumericVector SW_max,
            NumericVector rain,
            NumericVector swc,
            NumericVector PETc) {

  int n = SW_ini.length();
  NumericVector SW(n);
  NumericVector PAW(n);
  NumericVector aetc(n);

  double SW_ini_glob = SW_ini[0];
  double SW_max_glob = SW_max[0];

  SW[0] = SW_ini_glob;
  PAW[0] = SW[0] + rain[0];

  if (PAW[0] > swc[0]){
    aetc[0] = PETc[0];
  } else {
    aetc[0] = PAW[0]/swc[0]*PETc[0];
  }

  if (aetc[0] > PAW[0]){
    aetc[0] = PAW[0];
  }

  SW[0] = SW[0] + rain[0] - aetc[0];

  if(SW[0] > SW_max_glob){
    SW[0] = SW_max_glob;
  }

  if(SW[0] < 0){
    SW[0] = 0;
  }

  for (int i = 1; i < n; i++) {

    PAW[i] = SW[i-1] + rain[0];

    if (PAW[i] > swc[i]){
      aetc[i] = PETc[i];
    } else {
      aetc[i] = PAW[i]/swc[i]*PETc[i];
    }

    if (aetc[i] > PAW[i]){
      aetc[i] = PAW[i];
    }

    SW[i] = SW[i-1] + rain[i] - aetc[i];

    if(SW[i] > SW_max_glob){
      SW[i] = SW_max_glob;
    }

    if(SW[i] < 0){
     SW[i] = 0;
    }
  }
  return Rcpp::List::create(Rcpp::Named("SW") = SW,
                            Rcpp::Named("PAW") = PAW,
                            Rcpp::Named("aetc") = aetc);
}')

Create data.table 创建data.table

df <- data.table(loc.id = rep(1:10, each = 80*36), 
                 year = rep(rep(1980:2015, each = 80), times = 10),
                 day = rep(rep(1:80, times = 36),times = 10),
                 rain = runif(10*36*80, min = 0 , max = 5),
                 swc = runif(10*36*80,min = 0, max = 50),
                 SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
                 SW_ini = runif(10*36*80),
                 PETc = runif(10*36*80, min = 0 , max = 1.3),
                 SW = as.numeric(NA),
                 PAW = as.numeric(NA), 
                 aetc = as.numeric(NA))

setkey(df, loc.id, year, day)

Execute the function CalcSW() on the df for each combination of loc.id and year , assign returned values to the three columns simultaneously: 对于loc.idyear每个组合,在df上执行函数CalcSW() ,同时将返回值分配给三列:

system.time({
  df[,  c("SW","PAW","aetc") := CalcSW(SW_ini,
                                       SW_max,
                                       rain,
                                       swc,
                                       PETc), keyby = .(loc.id, year)]
})

... ...

   user  system elapsed 
  0.004   0.000   0.004 

Results: 结果:

head(df)

... ...

   loc.id year day       rain       swc   SW_max     SW_ini      PETc       SW      PAW       aetc
1:      1 1980   1 0.35813251 28.360715 177.3943 0.69116310 0.2870478 1.038675 1.049296 0.01062025
2:      1 1980   2 1.10331116 37.013022 177.3943 0.02742273 0.4412420 2.125335 1.396808 0.01665171
3:      1 1980   3 1.76680011 32.509970 177.3943 0.66273062 1.1071233 3.807561 2.483467 0.08457420
4:      1 1980   4 3.20966558  8.252797 177.3943 0.12220454 0.3496968 6.840713 4.165693 0.17651342
5:      1 1980   5 1.32498191 14.784203 177.3943 0.66381497 1.2168838 7.573160 7.198845 0.59253503
6:      1 1980   6 0.02547458 47.903637 177.3943 0.21871598 1.0864713 7.418750 7.931292 0.17988449

I'm not 100% positive I implemented your logic perfectly, but the logic should be pretty straightforward to tweak where I may have missed something, I implemented it in a very similar manner to how you laid it out. 我并非100%肯定我完全实现了你的逻辑,但是在调整我可能错过的东西的逻辑应该非常简单,我以非常类似的方式实现它。


One other note: It's way easier to write C++ with auto-indenting and code highlighting (whether you're using RStudio or Emacs) you get if you create a separate file, named something like TestCode.cpp formatted like below. 另一个注意事项:如果你创建一个单独的文件,如下面的TestCode.cpp那样命名,那么用自动缩进和代码突出显示(无论你是使用RStudio还是Emacs)来编写C++更容易。

Then, you can either use Rcpp::sourceCpp("TestCode.cpp") to compile your function in your R Script, or you can copy and paste everything except for the first three lines as a character string into as an argument of Rcpp::cppFunction() like I did above. 然后,你可以使用Rcpp::sourceCpp("TestCode.cpp")在你的R脚本中编译你的函数,或者你可以将除了前三行之外的所有内容复制并粘贴为字符串作为Rcpp::cppFunction()的参数Rcpp::cppFunction()就像我上面做的那样。

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]
List CalcSW(NumericVector SW_ini,
                     NumericVector SW_max,
                     NumericVector rain,
                     NumericVector swc,
                     NumericVector PETc) {

  int n = SW_ini.length();
  NumericVector SW(n);
  NumericVector PAW(n);
  NumericVector aetc(n);

  double SW_ini_glob = SW_ini[0];
  double SW_max_glob = SW_max[0];

  SW[0] = SW_ini_glob;
  PAW[0] = SW[0] + rain[0];

  if (PAW[0] > swc[0]){
    aetc[0] = PETc[0];
  } else {
    aetc[0] = PAW[0]/swc[0]*PETc[0];
  }

  if (aetc[0] > PAW[0]){
    aetc[0] = PAW[0];
  }

  SW[0] = SW[0] + rain[0] - aetc[0];

  if(SW[0] > SW_max_glob){
    SW[0] = SW_max_glob;
  }

  if(SW[0] < 0){
    SW[0] = 0;
  }

  for (int i = 1; i < n; i++) {

    PAW[i] = SW[i-1] + rain[0];

    if (PAW[i] > swc[i]){
      aetc[i] = PETc[i];
    } else {
      aetc[i] = PAW[i]/swc[i]*PETc[i];
    }

    if (aetc[i] > PAW[i]){
      aetc[i] = PAW[i];
    }

    SW[i] = SW[i-1] + rain[i] - aetc[i];

    if(SW[i] > SW_max_glob){
      SW[i] = SW_max_glob;
    }

    if(SW[i] < 0){
      SW[i] = 0;
    }
  }
  return Rcpp::List::create(Rcpp::Named("SW") = SW,
                            Rcpp::Named("PAW") = PAW,
                            Rcpp::Named("aetc") = aetc);
}

You could wrap your code in another for loop and save each years df in a list: 您可以将代码包装在另一个for循环中,并将每年df保存在列表中:

library(tidyverse)
lst <- vector("list", length(unique(df$year)))
for (i in seq_along(unique(df$year))) {
    df_year <- df %>% filter(year == unique(df$year)[[i]])

    # rest of code with df_year replacing df

    lst[[i]] <- df_year
}
final_df <- bind_rows(lst)

The data.table illustration from Matt is a very good illustration of how fast data.table can be because it does the calculations in place with no copies and moving around of data. 来自Matt的data.table插图很好地说明了data.table速度有多快,因为它可以在没有副本和移动数据的情况下进行计算。

However, to answer the crux of your question about using pipes, you can use group_by along with do to accomplish what you are after (albeit much slower than data.table ) 但是,要回答关于使用管道的问题的关键,你可以使用group_bydo来完成你的目标(尽管比data.table慢得多)

Below I set up the same dummy data Matt did. 下面我设置了Matt所做的相同虚拟数据。 Then I use your function (but with the case fixed on PETc ). 然后我使用你的功能(但在PETc固定的情况下)。 It's not fast, but it's pretty easy to follow. 它并不快,但它很容易遵循。

df <- data.frame(loc.id = rep(1:10, each = 80*36), 
                 year = rep(rep(1980:2015, each = 80), times = 10),
                 day = rep(rep(1:80, times = 36),times = 10),
                 rain = runif(10*36*80, min = 0 , max = 5),
                 swc = runif(10*36*80,min = 0, max = 50),
                 SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
                 SW_ini = runif(10*36*80),
                 PETc = runif(10*36*80, min = 0 , max = 1.3) 
                 )

my_fun <- function(df){
  SW.ini <- 2 # setting some initial values 
  SW.max <- 5
  SW.min <- 0

  df$PAW <- NA
  df$aetc <- NA
  df$SW <- NA

  df$PAW[1] <- SW.ini + df$rain[1]

  df$aetc[1] <- ifelse(df$PAW[1] >= df$swc[1], df$PETc[1],(df$PAW[1]/df$swc[1])*df$PETc[1])
  df$aetc[1] <- ifelse(df$aetc[1] > df$PAW[1], df$PAW[1], df$aetc[1])
  df$SW[1] <- SW.ini + df$rain[1] -  df$aetc[1]
  df$SW[1] <- ifelse(df$SW[1] > SW.max, SW.max, ifelse(df$SW[1] < 0, 0,df$SW[1]))

  for (day in 2:nrow(df)){

    df$PAW[day] <- df$SW[day - 1] + df$rain[day]
    df$aetc[day] <- ifelse(df$PAW[day] >= df$swc[day], df$PETc[day], (df$PAW[day]/df$swc[day]) * df$PETc[day])
    df$aetc[day] <- ifelse(df$aetc[day] > df$PAW[day], df$PAW[day],df$aetc[day])
    df$SW[day] <- df$SW[day - 1] + df$rain[day] -  df$aetc[day]
    df$SW[day] <- ifelse(df$SW[day] > SW.max,SW.max, ifelse(df$SW[day] < 0, 0,df$SW[day]))
  }
  return(df)
}


library(tictoc)
library(tidyverse)


tic()
df  %>% 
  group_by(year) %>%
  do(my_fun(.)) -> 
  out
toc()
#> 5.075 sec elapsed

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM