[英]R: dplyr solution for for-loop with initial conditions set
I have a data which has 40 days of the year and some data 我有一个数据,一年有40天和一些数据
set.seed(123)
df <- data.frame(day = 1:40,rain = runif(40,min = 0, max = 3), petc = runif(40, min = 0.3, max = 8),swc = runif(40, min = 27.01, max = 117.43))
I want to calculate another variable called aetc for each day which is calculated as follows: 我想计算每天另一个名为aetc的变量,计算方法如下:
SW.ini <- 2 # setting some initial values
SW.max <- 5
SW.min <- 0
For day 1, 第1天,
1) Determine a variable called PAW(day1) = SW.ini + rain(day1)
1)确定一个名为
PAW(day1) = SW.ini + rain(day1)
的变量PAW(day1) = SW.ini + rain(day1)
2) If PAW(day1) >= SWC(day1), aetc(day1) = petc(day1)
; 2)如果
PAW(day1) >= SWC(day1), aetc(day1) = petc(day1)
;
If `PAW(day1) < SWC(day1), aetc(day1) = PAW(day1)/SWC(day1) * petc(day1)`
3) Check if aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1)
3)检查
aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1)
aetc(day1) > PAW(day1). If yes, aetc(day1) = paw(day1)
4) Update SW(day1) = SW.ini + rain(day1) - aetc(day1)
4)更新
SW(day1) = SW.ini + rain(day1) - aetc(day1)
5) If SW(day1) > SW.max, SW(day1) = SW.max. Similarly if
5)如果
SW(day1) > SW.max, SW(day1) = SW.max. Similarly if
SW(day1) > SW.max, SW(day1) = SW.max. Similarly if
SW(day1) < SW.min, SW(day1) = SW.min` SW(day1) > SW.max, SW(day1) = SW.max. Similarly if
SW(day1)<SW.min,SW(day1)= SW.min`
Repeat for day 2 重复第2天
1) Determine PAW(day2) = SW(day1) + rain(day2)
1)确定
PAW(day2) = SW(day1) + rain(day2)
2) If PAW(day2) >= SWC(day2), aetc(day2) = petc(day2)
; 2)如果
PAW(day2) >= SWC(day2), aetc(day2) = petc(day2)
2天) PAW(day2) >= SWC(day2), aetc(day2) = petc(day2)
; If PAW(day2) < SWC(day2), aetc(day2) = PAW(day2)/SWC(day2) * petc(day2)
如果
PAW(day2) < SWC(day2), aetc(day2) = PAW(day2)/SWC(day2) * petc(day2)
3) Check if aetc(day2) > PAW(day2)
. 3)检查
aetc(day2) > PAW(day2)
。 If yes, aetc(day2) = paw(day2)
如果是,
aetc(day2) = paw(day2)
4) Update SW(day2) = SW(day1) + rain(day2) - aetc(day2)
4)更新
SW(day2) = SW(day1) + rain(day2) - aetc(day2)
5) If SW(day2) > SW.max, SW(day2) = SW.max. Similarly if
5)如果
SW(day2) > SW.max, SW(day2) = SW.max. Similarly if
SW(day2) > SW.max, SW(day2) = SW.max. Similarly if
SW(day2) < SW.min, SW(day2) = SW.min` SW(day2) > SW.max, SW(day2) = SW.max. Similarly if
SW(第2天)<SW.min,SW(第2天)= SW.min`
Here's my elegant for loop to do this: 这是我优雅的for循环来做到这一点:
df$PAW <- NA
df$aetc <- NA
df$SW <- NA
df$PAW[1] <- SW.ini + df$rain[1]
df$aetc[1] <- ifelse(df$PAW[1] >= df$swc[1], df$petc[1],(df$PAW[1]/df$swc[1])*df$petc[1])
df$aetc[1] <- ifelse(df$aetc[1] > df$PAW[1], df$PAW[1], df$aetc[1])
df$SW[1] <- SW.ini + df$rain[1] - df$aetc[1]
df$SW[1] <- ifelse(df$SW[1] > SW.max, SW.max, ifelse(df$SW[1] < 0, 0,df$SW[1]))
for (day in 2:nrow(df)){
df$PAW[day] <- df$SW[day - 1] + df$rain[day]
df$aetc[day] <- ifelse(df$PAW[day] >= df$swc[day], df$petc[day], (df$PAW[day]/df$swc[day]) * df$petc[day])
df$aetc[day] <- ifelse(df$aetc[day] > df$PAW[day], df$PAW[day],df$aetc[day])
df$SW[day] <- df$SW[day - 1] + df$rain[day] - df$aetc[day]
df$SW[day] <- ifelse(df$SW[day] > SW.max,SW.max, ifelse(df$SW[day] < 0, 0,df$SW[day]))
}
My problem is that this is just one year of data and I want run it for multiple years. 我的问题是,这只是一年的数据,我想运行它多年。
set.seed(123)
df <- data.frame(year = 1980:2015, day = rep(1:40, each = 36),rain =
runif(40*36,min = 0, max = 3), petc = runif(40*36, min = 0.3, max = 8),swc = runif(40*36, min = 27.01, max = 117.43))
So I wanted to do something like 所以我想做点什么
df %>% group_by(year) # and then run the above function for each year.
Is there a dplyr or any other solution to this? 是否有dplyr或任何其他解决方案?
Thank you 谢谢
Note: I originally posted this answer on your follow up question, R: for loop within a foreach loop , but after seeing this one, it seems this answer is far more relevant here.
注意:我最初在你的后续问题上发布了这个答案, R:for fore循环中的循环 ,但在看到这个之后,似乎这个答案在这里更具相关性。 (I don't address anything related to parallelizing in my answer, which was the topic of your follow up) .
(我没有解决与我的答案中的并行化相关的任何问题,这是您跟进的主题) 。
Rcpp
and data.table
Rcpp
和data.table
Compiling the logic with C++ and applying it by group using data.table grouping operations gives a ~2,000x speed-up from your baseline, far greater than you might hope to get by parallelizing. 使用C ++编译逻辑并使用data.table分组操作按组应用它可以使您的基线速度提高约2,000倍,远远超过您希望通过并行化获得的速度。
On your original example, which had 39,420,000 rows , this executes on my machine in 1.883 seconds ; 在原始示例中,它有39,420,000行 ,这在我的机器上以1.883秒执行; and on the revised one with 28,800 rows , this executes in 0.004 seconds
在修改后的28,800行中 ,这将在0.004秒内执行
library(data.table)
library(Rcpp)
Define and compile a C++
function, CalcSW()
inline in the R script: 在R脚本中内联定义并编译
C++
函数CalcSW()
:
One note: counting in C
/ C++
starts at 0
, unlike R
, which starts at 1
-- that's why the indices are different here 一个注意事项:
C
/ C++
计数从0
开始,与R
不同,从1
开始 - 这就是为什么这里的指数不同
Rcpp::cppFunction('
List CalcSW(NumericVector SW_ini,
NumericVector SW_max,
NumericVector rain,
NumericVector swc,
NumericVector PETc) {
int n = SW_ini.length();
NumericVector SW(n);
NumericVector PAW(n);
NumericVector aetc(n);
double SW_ini_glob = SW_ini[0];
double SW_max_glob = SW_max[0];
SW[0] = SW_ini_glob;
PAW[0] = SW[0] + rain[0];
if (PAW[0] > swc[0]){
aetc[0] = PETc[0];
} else {
aetc[0] = PAW[0]/swc[0]*PETc[0];
}
if (aetc[0] > PAW[0]){
aetc[0] = PAW[0];
}
SW[0] = SW[0] + rain[0] - aetc[0];
if(SW[0] > SW_max_glob){
SW[0] = SW_max_glob;
}
if(SW[0] < 0){
SW[0] = 0;
}
for (int i = 1; i < n; i++) {
PAW[i] = SW[i-1] + rain[0];
if (PAW[i] > swc[i]){
aetc[i] = PETc[i];
} else {
aetc[i] = PAW[i]/swc[i]*PETc[i];
}
if (aetc[i] > PAW[i]){
aetc[i] = PAW[i];
}
SW[i] = SW[i-1] + rain[i] - aetc[i];
if(SW[i] > SW_max_glob){
SW[i] = SW_max_glob;
}
if(SW[i] < 0){
SW[i] = 0;
}
}
return Rcpp::List::create(Rcpp::Named("SW") = SW,
Rcpp::Named("PAW") = PAW,
Rcpp::Named("aetc") = aetc);
}')
Create data.table 创建data.table
df <- data.table(loc.id = rep(1:10, each = 80*36),
year = rep(rep(1980:2015, each = 80), times = 10),
day = rep(rep(1:80, times = 36),times = 10),
rain = runif(10*36*80, min = 0 , max = 5),
swc = runif(10*36*80,min = 0, max = 50),
SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
SW_ini = runif(10*36*80),
PETc = runif(10*36*80, min = 0 , max = 1.3),
SW = as.numeric(NA),
PAW = as.numeric(NA),
aetc = as.numeric(NA))
setkey(df, loc.id, year, day)
Execute the function CalcSW()
on the df
for each combination of loc.id
and year
, assign returned values to the three columns simultaneously: 对于
loc.id
和year
每个组合,在df
上执行函数CalcSW()
,同时将返回值分配给三列:
system.time({
df[, c("SW","PAW","aetc") := CalcSW(SW_ini,
SW_max,
rain,
swc,
PETc), keyby = .(loc.id, year)]
})
... ...
user system elapsed
0.004 0.000 0.004
Results: 结果:
head(df)
... ...
loc.id year day rain swc SW_max SW_ini PETc SW PAW aetc
1: 1 1980 1 0.35813251 28.360715 177.3943 0.69116310 0.2870478 1.038675 1.049296 0.01062025
2: 1 1980 2 1.10331116 37.013022 177.3943 0.02742273 0.4412420 2.125335 1.396808 0.01665171
3: 1 1980 3 1.76680011 32.509970 177.3943 0.66273062 1.1071233 3.807561 2.483467 0.08457420
4: 1 1980 4 3.20966558 8.252797 177.3943 0.12220454 0.3496968 6.840713 4.165693 0.17651342
5: 1 1980 5 1.32498191 14.784203 177.3943 0.66381497 1.2168838 7.573160 7.198845 0.59253503
6: 1 1980 6 0.02547458 47.903637 177.3943 0.21871598 1.0864713 7.418750 7.931292 0.17988449
I'm not 100% positive I implemented your logic perfectly, but the logic should be pretty straightforward to tweak where I may have missed something, I implemented it in a very similar manner to how you laid it out. 我并非100%肯定我完全实现了你的逻辑,但是在调整我可能错过的东西的逻辑应该非常简单,我以非常类似的方式实现它。
One other note: It's way easier to write C++
with auto-indenting and code highlighting (whether you're using RStudio or Emacs) you get if you create a separate file, named something like TestCode.cpp
formatted like below. 另一个注意事项:如果你创建一个单独的文件,如下面的
TestCode.cpp
那样命名,那么用自动缩进和代码突出显示(无论你是使用RStudio还是Emacs)来编写C++
更容易。
Then, you can either use Rcpp::sourceCpp("TestCode.cpp")
to compile your function in your R Script, or you can copy and paste everything except for the first three lines as a character string into as an argument of Rcpp::cppFunction()
like I did above. 然后,你可以使用
Rcpp::sourceCpp("TestCode.cpp")
在你的R脚本中编译你的函数,或者你可以将除了前三行之外的所有内容复制并粘贴为字符串作为Rcpp::cppFunction()
的参数Rcpp::cppFunction()
就像我上面做的那样。
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
List CalcSW(NumericVector SW_ini,
NumericVector SW_max,
NumericVector rain,
NumericVector swc,
NumericVector PETc) {
int n = SW_ini.length();
NumericVector SW(n);
NumericVector PAW(n);
NumericVector aetc(n);
double SW_ini_glob = SW_ini[0];
double SW_max_glob = SW_max[0];
SW[0] = SW_ini_glob;
PAW[0] = SW[0] + rain[0];
if (PAW[0] > swc[0]){
aetc[0] = PETc[0];
} else {
aetc[0] = PAW[0]/swc[0]*PETc[0];
}
if (aetc[0] > PAW[0]){
aetc[0] = PAW[0];
}
SW[0] = SW[0] + rain[0] - aetc[0];
if(SW[0] > SW_max_glob){
SW[0] = SW_max_glob;
}
if(SW[0] < 0){
SW[0] = 0;
}
for (int i = 1; i < n; i++) {
PAW[i] = SW[i-1] + rain[0];
if (PAW[i] > swc[i]){
aetc[i] = PETc[i];
} else {
aetc[i] = PAW[i]/swc[i]*PETc[i];
}
if (aetc[i] > PAW[i]){
aetc[i] = PAW[i];
}
SW[i] = SW[i-1] + rain[i] - aetc[i];
if(SW[i] > SW_max_glob){
SW[i] = SW_max_glob;
}
if(SW[i] < 0){
SW[i] = 0;
}
}
return Rcpp::List::create(Rcpp::Named("SW") = SW,
Rcpp::Named("PAW") = PAW,
Rcpp::Named("aetc") = aetc);
}
You could wrap your code in another for loop and save each years df in a list: 您可以将代码包装在另一个for循环中,并将每年df保存在列表中:
library(tidyverse)
lst <- vector("list", length(unique(df$year)))
for (i in seq_along(unique(df$year))) {
df_year <- df %>% filter(year == unique(df$year)[[i]])
# rest of code with df_year replacing df
lst[[i]] <- df_year
}
final_df <- bind_rows(lst)
The data.table
illustration from Matt is a very good illustration of how fast data.table
can be because it does the calculations in place with no copies and moving around of data. 来自Matt的
data.table
插图很好地说明了data.table
速度有多快,因为它可以在没有副本和移动数据的情况下进行计算。
However, to answer the crux of your question about using pipes, you can use group_by
along with do
to accomplish what you are after (albeit much slower than data.table
) 但是,要回答关于使用管道的问题的关键,你可以使用
group_by
和do
来完成你的目标(尽管比data.table
慢得多)
Below I set up the same dummy data Matt did. 下面我设置了Matt所做的相同虚拟数据。 Then I use your function (but with the case fixed on
PETc
). 然后我使用你的功能(但在
PETc
固定的情况下)。 It's not fast, but it's pretty easy to follow. 它并不快,但它很容易遵循。
df <- data.frame(loc.id = rep(1:10, each = 80*36),
year = rep(rep(1980:2015, each = 80), times = 10),
day = rep(rep(1:80, times = 36),times = 10),
rain = runif(10*36*80, min = 0 , max = 5),
swc = runif(10*36*80,min = 0, max = 50),
SW_max = rep(runif(10, min = 100, max = 200), each = 80*36),
SW_ini = runif(10*36*80),
PETc = runif(10*36*80, min = 0 , max = 1.3)
)
my_fun <- function(df){
SW.ini <- 2 # setting some initial values
SW.max <- 5
SW.min <- 0
df$PAW <- NA
df$aetc <- NA
df$SW <- NA
df$PAW[1] <- SW.ini + df$rain[1]
df$aetc[1] <- ifelse(df$PAW[1] >= df$swc[1], df$PETc[1],(df$PAW[1]/df$swc[1])*df$PETc[1])
df$aetc[1] <- ifelse(df$aetc[1] > df$PAW[1], df$PAW[1], df$aetc[1])
df$SW[1] <- SW.ini + df$rain[1] - df$aetc[1]
df$SW[1] <- ifelse(df$SW[1] > SW.max, SW.max, ifelse(df$SW[1] < 0, 0,df$SW[1]))
for (day in 2:nrow(df)){
df$PAW[day] <- df$SW[day - 1] + df$rain[day]
df$aetc[day] <- ifelse(df$PAW[day] >= df$swc[day], df$PETc[day], (df$PAW[day]/df$swc[day]) * df$PETc[day])
df$aetc[day] <- ifelse(df$aetc[day] > df$PAW[day], df$PAW[day],df$aetc[day])
df$SW[day] <- df$SW[day - 1] + df$rain[day] - df$aetc[day]
df$SW[day] <- ifelse(df$SW[day] > SW.max,SW.max, ifelse(df$SW[day] < 0, 0,df$SW[day]))
}
return(df)
}
library(tictoc)
library(tidyverse)
tic()
df %>%
group_by(year) %>%
do(my_fun(.)) ->
out
toc()
#> 5.075 sec elapsed
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.