使用 Purrr::map2 或 pmap 避免 for 循环

Question

我拼命地试图避免 for 循环来计算自定义财务指标（多只股票，每只股票 5,000 行）。 我正在尝试使用purrr::map2 ，并且在对现有向量进行数学运算时很好，但我需要引用我正在尝试创建的向量的滞后（先前）值。 在不引用先前值的情况下， purrr::map2可以正常工作：

 some_function <- function(a, b) { (a * b) + ((1 - a) * b) } a <- c(0.019, 0.026, 0.012, 0.022) # some indicator b <- c(15.5, 16.7, 14.8, 13.1) # close price purrr::map2(a, b, some_function)

这只会产生原始的收盘价

15.5, 16.7, 14.8, 13.1

但我真正想做的是创建一个新的向量 (c)，它作为计算的一部分回顾自身 (滞后)。 如果是第一行，c == b，否则：

 desired_function <- function(a, b, c) { (a * b) + ((1 - a) * lag(c)) }

所以我创建了一个向量c并填充并尝试：

 c <- c(15.5, 0, 0, 0) purrr::map2(a, b, c, desired_function)

显然，得到所有 NULL 值。
c 的值应为： 15.50, 15.53, 15.52, 15.47

引用以前的值在指标中是很常见的事情，它迫使我将 go 变成笨重、缓慢的“for 循环”。 非常感谢任何建议。

Answer 1

如果计算一个向量中的某个值需要来自同一个向量的另一个值，那么它就不能被向量化； 你必须一个接一个地计算它们。

For循环本身并不慢； 这就是你使用它们的方式。 例如，一次从数据帧中检索一个值，或者一次插入一个值，是一种非常缓慢的常见做法。

在过去 10 年中，R 中 for 循环的实现有了很大的改进，据说它们过去效率较低，在旧帖子中你会发现很多人抱怨它。

推荐阅读：

https://www.r-bloggers.com/2018/06/why-loops-are-slow-in-r/

这两个老问题（嗯，他们的答案）：

加快 R 中的循环操作

为什么 R 中的循环很慢？

一个小实验

让我们用 purrr::map() 对没有滞后的 function 的最简单（最愚蠢的？）for循环实现进行基准测试： c = a*b + (1-a) * b

在这个包含 1000 万个项目的基准测试中，for 循环比 purrr::map2() 快 15 倍以上。

 # functions --------------------------------------------------------------- desired_function <- function(a,b) { a*b + (1-a) * b } des_fnc_for <- function(a, b) { c <- numeric(length(a)) c[1] <- b[1] for(i in seq_along(a)) c[i] <- a[i] * b[i] + (1 - a[i]) * b[i] return(c) } # verify -------------------------------------------------------------------- a <- c(0.019, 0.026, 0.012, 0.022) # some indicator b <- c(15.5, 16.7, 14.8, 13.1) # close price unlist(purrr::map2(a,b,desired_function)) [1] 15.5 16.7 14.8 13.1 des_fnc_for(a,b) [1] 15.5 16.7 14.8 13.1 # benchmark --------------------------------------------------------------- a <- runif(10000000, 0.01, 0.03) b <- runif(10000000, 13, 17) system.time( des_fnc_for(a,b) ) user system elapsed 1.143 0.007 1.163 system.time( purrr::map2(a,b,desired_function) ) user system elapsed 18.570 0.627 19.761

Answer 2

Here some solutions, first one refers to your idea using stats::lag (using stats::, because the dplyr package always masks lag!),

r <- numeric(4L)
for (i in 1:4) {
  r[i] <- c[i + 1] <- a[i]*b[i] + (1 - a[i])*stats::lag(c)[i]
}
r
# [1] 15.50000 15.53120 15.52243 15.46913

and another one using a starting value that updates in every iteration, which is about 20% faster.

r <- numeric(4L)
sval <- 15.5
for (i in 1:4) {
  r[i] <- sval <- a[i]*b[i] + (1 - a[i])*sval
}
r
# [1] 15.50000 15.53120 15.52243 15.46913

Data:

a <- c(0.019, 0.026, 0.012, 0.022)
b <- c(15.5, 16.7, 14.8, 13.1)
c <- c(15.5, 0, 0, 0)

使用 Purrr::map2 或 pmap 避免 for 循环

问题描述

2 个解决方案

解决方案1
0 2022-07-03 15:01:14

一个小实验

解决方案2
0 2022-07-03 15:31:13

使用 Purrr::map2 或 pmap 避免 for 循环

问题描述

2 个解决方案

解决方案1 0 2022-07-03 15:01:14

一个小实验

解决方案2 0 2022-07-03 15:31:13

解决方案1
0 2022-07-03 15:01:14

解决方案2
0 2022-07-03 15:31:13