如何有效地將 cumprod 應用於 tidyverse 中的所有列

Question

我有一個很大的隨機矩陣（500 x 100,000 或 500 x 10,000），我想將 cumprod 應用於每一列。 在 500 x 10,000 上，我的代碼在 13 秒內運行。 在 500 x 100,000 上，它在 32 分鍾內運行。 在 tidyverse 中是否有比以下更有效或更好的縮放方式來執行此操作？

library(tidyverse)
library(mc2d)

n.rows = 500  #Number of times
n.cols = 100000  #Number of samples
outcomes = as_tibble(matrix(1 + ifelse(rbern(n.rows * n.cols, .5), .5, -.4), 
                            nrow = n.rows)) %>% 
  modify(cumprod)

Answer 1

使用tidyverse ， across選擇遍歷列

out2 <- df1 %>%
    mutate(across(everything(), cumprod))

-檢查 OP 的 output

out1 <- df1 %>%
         modify(cumprod)

identical(out1, out2)
#[1] TRUE

如果還是慢，可以使用data.table方法

library(data.table)
setDT(df1)[, (seq_along(df1)) := lapply(.SD, cumprod)]

基准

使用更大的數據，一個選項也是使用lapply

n.rows <- 500  #Number of time
n.cols <- 100e3  #Number of samples

raw_data  <- as_tibble(matrix(1 + ifelse(rbern(n.rows * n.cols, .5), .5, -.4), nrow = n.rows))


system.time({out = raw_data %>% map_df(cumprod)})
#   user  system elapsed 
#  1.819   0.125   1.943 
system.time({raw_data[] <- lapply(raw_data, cumprod)})
#   user  system elapsed 
#  0.432   0.088   0.520

數據

df1 <- as_tibble(matrix(1 + ifelse(rbern(n.rows * n.cols, .5), .5, -.4), 
                         nrow = n.rows))

Answer 2

如果您固定tibble數據的行數，您實際上可以使用Reduce來實現cumprod ，這似乎是一個更快的選擇

do.call(rbind, Reduce("*", asplit(df, 1), accumulate = TRUE))

基准測試（感謝@akrun 的解決方案）

> system.time({
+   out1 <- do.call(rbind, Reduce("*", asplit(df, 1), accumulate = TRUE))
+ })
   user  system elapsed 
   6.55    0.22    6.76

> system.time({
+   out2 <- setDT(df)[, lapply(.SD, cumprod)]
+ })
   user  system elapsed 
  44.93    0.02   45.01

> all.equal(out1, as.matrix(out2))
[1] TRUE

其中虛擬數據df給出為

set.seed(1)
nr <- 500
df <- as_tibble(matrix(rnorm(5e7), nr), nrow = nr)

Answer 3

我對一堆解決方案進行了基准測試。 @ThomaslsCoding 的解決方案是提供的解決方案中最快的，但我似乎找到了一個更簡單且同樣快速的解決方案——使用 map_df。

> library(tidyverse)
> library(mc2d)
> library(data.table)
> 
> n.rows = 500  #Number of time
> n.cols = 100e3  #Number of samples
> 
> raw_data = as_tibble(matrix(1 + ifelse(rbern(n.rows * n.cols, .5), .5, -.4), nrow = n.rows))
> 
> system.time({out = raw_data %>% modify(cumprod)})
    user   system  elapsed 
1184.269   21.915 1207.873 
> system.time({out = raw_data %>% mutate(across(everything(), cumprod))})
    user   system  elapsed 
3495.240   15.792 3516.903 
> system.time({out = setDT(raw_data)[, lapply(.SD, cumprod)]})
   user  system elapsed 
 48.245   0.280  48.745 
> system.time({out = do.call(rbind, Reduce("*", asplit(raw_data, 1), accumulate = TRUE))})
   user  system elapsed 
  2.099   0.127   2.230 
> system.time({out = raw_data %>% map_df(cumprod)})
   user  system elapsed 
  1.306   0.008   1.316

下面的新測試

> system.time({out = do.call(rbind, Reduce("*", asplit(raw_data, 1), accumulate = TRUE))})
   user  system elapsed 
  1.777   0.032   1.814 
> system.time({out = raw_data %>% map_df(cumprod)})
   user  system elapsed 
  1.393   0.012   1.410 
> system.time({out[] = lapply(raw_data, cumprod)})
   user  system elapsed 
  0.278   0.003   0.282 
> system.time({out = apply(raw_data, 2, cumprod)})
   user  system elapsed 
  1.126   0.012   1.142

lapply 對我來說也是最快的

如何有效地將 cumprod 應用於 tidyverse 中的所有列

問題描述

3 個解決方案

解決方案1
2 2020-12-25 17:28:30

基准

數據

解決方案2
2 2020-12-26 00:19:00

解決方案3
1 2020-12-26 04:39:05

如何有效地將 cumprod 應用於 tidyverse 中的所有列

問題描述

3 個解決方案

解決方案1 2 2020-12-25 17:28:30

基准

數據

解決方案2 2 2020-12-26 00:19:00

解決方案3 1 2020-12-26 04:39:05

解決方案1
2 2020-12-25 17:28:30

解決方案2
2 2020-12-26 00:19:00

解決方案3
1 2020-12-26 04:39:05