优化R中的for循环

Question

DUMMY DATA SET: (difference from my data set is item_code is string in my case) 虚拟数据集：（与我的数据集不同的是item_code是字符串）

in_cluster <- data.frame(item_code = c(1:500))
in_cluster$cluster <-
        sample(5, size = nrow(in_cluster), replace = TRUE)
real_sales <- data.frame(item_code = numeric(0), sales = numeric(0))
real_sales <-
    data.frame(
            item_code = sample(500, size = 100000, replace = TRUE),
            sales = sample(500, size = 100000, replace = TRUE)
    )

mean_trajectory <- data.frame(sales = c(1:52))
mean_trajectory$sales <- sample(500, size = 52, replace = TRUE)
training_df <- data.frame(
        LTF_t_minus_1 = numeric(0),
        LTF_t = numeric(0),
        LTF_t_plus_1 = numeric(0),
        RS_t_minus_1 = numeric(0),
        RS_t = numeric(0),
        STF_t_plus_1 = numeric(0)
)
training_df[nrow(training_df) + 1, ] <-
        c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19) # week 0

week = 2

I have a simple function in R in which all I do is: 我在R中有一个简单的函数，其中我要做的是：

system.time({
    for (r in 1:nrow(in_cluster)) {
            item <- in_cluster[r,]
            sale_row <-
                    dplyr::filter(real_sales, item_code == item$item_code)
            if (nrow(sale_row) > 2) {
                    new_df <- data.frame(
                            LTF_t_minus_1 = mean_trajectory$sales[[week - 1]],
                            LTF_t = mean_trajectory$sales[[week]],
                            LTF_t_plus_1 = mean_trajectory$sales[[week + 1]],
                            RS_t_minus_1 = sale_row$sales[[week - 1]],
                            RS_t = sale_row$sales[[week]],
                            STF_t_plus_1 = sale_row$sales[[week + 1]]
                    )
                    training_df <-
                            bind_rows(training_df, new_df)
            }
    }
})

I am quite new to R and found this really weird looking at how small the data really is yet how long ( 421.59 seconds to loop through 500 rows) it is taking to loop through the data frame. 我对R还是很421.59 seconds ，发现在查看数据实际需要多少时间（ 421.59 seconds 500行需要421.59 seconds ）来遍历数据帧时，这确实很奇怪。

EDIT_IMPORTANT: However for above given dummy data set all it took was 1.10 seconds to get the output > could this be because of having string for item_code? EDIT_IMPORTANT：但是，对于上述给定的虚拟数据集，只花了1.10 seconds即可获得输出 >这可能是因为有item_code的字符串吗？ does it take that much time to process a string item_code. 处理一个字符串item_code是否花费那么多时间？ (I didn't use string for dummy data sets because I do not know how to have 500 unique strings for item_code in in_cluster , and have the same strings in real_sales as item_code ) （我没有使用假人数据集的字符串，因为我不知道如何有500个独特的字符串item_code在in_cluster ，并且具有相同的字符串real_sales为item_code ）

I read through few other articles which suggested ways to optimize the R code and used bind_rows instead of rbind or using: 我通读了其他一些文章，这些文章建议了优化R代码的方法，并使用bind_rows而不是rbind或使用：

training_df[nrow(training_df) + 1,] <-
    c(mean_trajectory$sales[[week-1]], mean_trajectory$sales[[week]], mean_trajectory$sales[[week+1]], sale_row$sales[[week-1]], sale_row$sales[[week]], sale_row$sales[[week+1]])

using bind_rows seems to have improved the performance by 36 seconds when looping through 500 rows of data frame in_cluster 遍历500行数据帧in_cluster时，使用bind_rows似乎将性能提高了36秒

Is it possible to use lapply in this scenario? 在这种情况下可以使用lapply吗？ I tried code below and got an error: 我在下面尝试了代码，但出现错误：

Error in filter_impl(.data, dots) : $ operator is invalid for atomic vectors filter_impl（.data，点）中的错误：$运算符对原子向量无效

myfun <- function(item, sales, mean_trajectory, week) {
sale_row<- filter(sales, item_code == item$item_code)
data.frame(
  LTF_t_minus_1 = mean_trajectory$sales[[week-1]],
  LTF_t = mean_trajectory$sales[[week]],
  LTF_t_plus_1 = mean_trajectory$sales[[week+1]],
  RS_t_minus_1 = sale_row$sales[[week-1]],
  RS_t = sale_row$sales[[week]],
  STF_t_plus_1 = sale_row$sales[[week+1]])  
}

system.time({
      lapply(in_cluster, myfun, sales= sales, mean_trajectory = mean_trajectory) %>% bind_rows()
})

Help with lapply would be appreciated, however my main target is to speed up the loop. 对于lapply帮助将不胜感激，但是我的主要目标是加快循环速度。

Answer 1

Ok, so there a lot of bad practices in your code. 好的，因此您的代码中存在许多不良做法。

You are operating per row 您正在按行进行操作
You are creating 2(!) new data frames per row (very expensive) 您正在每行创建2（！）个新数据帧（非常昂贵）
You are growing objects in a loop )that training_df <- bind_rows(training_df, new_df) keeps growing in each iteration while running a pretty expensive operation ( bind_rows )) 您正在循环生长对象）， training_df <- bind_rows(training_df, new_df)在每次迭代中都在不断增长，同时运行相当昂贵的操作（ bind_rows ））
You are running the same operation over and over again when you could just run them once (why are you running mean_trajectory$sales[[week-1]] and al per row while mean_trajectory has nothing to do with the loop? You could just assign it afterwards). 您一次又一次地运行相同的操作（为什么要分别运行mean_trajectory$sales[[week-1]]和每行一次，而mean_trajectory与循环无关吗？之后）。
And the list goes on... 而这样的例子不胜枚举...

I would suggest an alternative simple data.table solution which will perform much better. 我建议使用另一种简单的data.table解决方案，它的性能会更好。 The idea is to first make a binary join between in_cluster and real_sales (and run all the operations while joining without creating extra data frames and then binding them). 这个想法是首先在in_cluster和real_sales之间in_cluster二进制连接（并在连接时运行所有操作，而不创建额外的数据帧，然后将它们绑定）。 Then, run all the mean_trajectory related lines only once. 然后，只运行一次所有与mean_trajectory相关的行。 (I ignored the training_df[nrow(training_df) + 1, ] <- c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19) initialization as it's irrelevant here and you can just add it afterwards using and rbind ) （我忽略了training_df[nrow(training_df) + 1, ] <- c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19)初始化，因为此处无关紧要，您可以在以后使用和rbind ）

library(data.table) #v1.10.4
## First step
res <-
  setDT(real_sales)[setDT(in_cluster), # binary join
                  if(.N > 2) .(RS_t_minus_1 = sales[week - 1], # The stuff you want to do
                               RS_t = sales[week],             # by condition
                               STF_t_plus_1 = sales[week + 1]), 
                  on = "item_code", # The join key
                  by = .EACHI] # Do the operations per each join

## Second step (run the `mean_trajectory` only once)
res[, `:=`(LTF_t_minus_1 = mean_trajectory$sales[week - 1],
           LTF_t = mean_trajectory$sales[week],
           LTF_t_plus_1 = mean_trajectory$sales[week + 1])]

Some benchmarks: 一些基准：

### Creating your data sets
set.seed(123)
N <- 1e5
N2 <- 5e7

in_cluster <- data.frame(item_code = c(1:N))

real_sales <-
  data.frame(
    item_code = sample(N, size = N2, replace = TRUE),
    sales = sample(N, size = N2, replace = TRUE)
  )

mean_trajectory <- data.frame(sales = sample(N, size = 25, replace = TRUE))

training_df <- data.frame(
  LTF_t_minus_1 = numeric(0),
  LTF_t = numeric(0),
  LTF_t_plus_1 = numeric(0),
  RS_t_minus_1 = numeric(0),
  RS_t = numeric(0),
  STF_t_plus_1 = numeric(0)
)
week = 2

###############################
################# Your solution
system.time({
  for (r in 1:nrow(in_cluster)) {
    item <- in_cluster[r,, drop = FALSE]
    sale_row <-
      dplyr::filter(real_sales, item_code == item$item_code)
    if (nrow(sale_row) > 2) {
      new_df <- data.frame(
        LTF_t_minus_1 = mean_trajectory$sales[[week - 1]],
        LTF_t = mean_trajectory$sales[[week]],
        LTF_t_plus_1 = mean_trajectory$sales[[week + 1]],
        RS_t_minus_1 = sale_row$sales[[week - 1]],
        RS_t = sale_row$sales[[week]],
        STF_t_plus_1 = sale_row$sales[[week + 1]]
      )
      training_df <-
        bind_rows(training_df, new_df)
    }
  }
}) 
### Ran forever- I've killed it after half an hour


######################
########## My solution
library(data.table)
system.time({
res <-
  setDT(real_sales)[setDT(in_cluster), 
                  if(.N > 2) .(RS_t_minus_1 = sales[week - 1],
                               RS_t = sales[week],
                               STF_t_plus_1 = sales[week + 1]), 
                  on = "item_code",
                  by = .EACHI]
res[, `:=`(LTF_t_minus_1 = mean_trajectory$sales[week - 1],
           LTF_t = mean_trajectory$sales[week],
           LTF_t_plus_1 = mean_trajectory$sales[week + 1])]
})

# user  system elapsed 
# 2.42    0.05    2.47

So for 50MM rows the data.table solution ran for about 2 secs, while your solution ran endlessly until I've killed it (after half an hour). 因此，对于data.table行， data.table解决方案运行了大约2秒钟，而您的解决方案却无休止地运行，直到我将其杀死（半小时后）。

优化R中的for循环

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-04-23 20:20:15

优化R中的for循环

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-04-23 20:20:15

解决方案1
5 已采纳 2017-04-23 20:20:15