[英]Optimize the for loop in R
DUMMY DATA SET: (difference from my data set is item_code is string in my case) 虚拟数据集:(与我的数据集不同的是item_code是字符串)
in_cluster <- data.frame(item_code = c(1:500))
in_cluster$cluster <-
sample(5, size = nrow(in_cluster), replace = TRUE)
real_sales <- data.frame(item_code = numeric(0), sales = numeric(0))
real_sales <-
data.frame(
item_code = sample(500, size = 100000, replace = TRUE),
sales = sample(500, size = 100000, replace = TRUE)
)
mean_trajectory <- data.frame(sales = c(1:52))
mean_trajectory$sales <- sample(500, size = 52, replace = TRUE)
training_df <- data.frame(
LTF_t_minus_1 = numeric(0),
LTF_t = numeric(0),
LTF_t_plus_1 = numeric(0),
RS_t_minus_1 = numeric(0),
RS_t = numeric(0),
STF_t_plus_1 = numeric(0)
)
training_df[nrow(training_df) + 1, ] <-
c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19) # week 0
week = 2
I have a simple function in R in which all I do is: 我在R中有一个简单的函数,其中我要做的是:
system.time({
for (r in 1:nrow(in_cluster)) {
item <- in_cluster[r,]
sale_row <-
dplyr::filter(real_sales, item_code == item$item_code)
if (nrow(sale_row) > 2) {
new_df <- data.frame(
LTF_t_minus_1 = mean_trajectory$sales[[week - 1]],
LTF_t = mean_trajectory$sales[[week]],
LTF_t_plus_1 = mean_trajectory$sales[[week + 1]],
RS_t_minus_1 = sale_row$sales[[week - 1]],
RS_t = sale_row$sales[[week]],
STF_t_plus_1 = sale_row$sales[[week + 1]]
)
training_df <-
bind_rows(training_df, new_df)
}
}
})
I am quite new to R and found this really weird looking at how small the data really is yet how long ( 421.59 seconds
to loop through 500 rows) it is taking to loop through the data frame. 我对R还是很
421.59 seconds
,发现在查看数据实际需要多少时间( 421.59 seconds
500行需要421.59 seconds
)来遍历数据帧时,这确实很奇怪。
EDIT_IMPORTANT: However for above given dummy data set all it took was 1.10 seconds
to get the output > could this be because of having string for item_code? EDIT_IMPORTANT:但是,对于上述给定的虚拟数据集,只花了
1.10 seconds
即可获得输出 >这可能是因为有item_code的字符串吗? does it take that much time to process a string item_code. 处理一个字符串item_code是否花费那么多时间? (I didn't use string for dummy data sets because I do not know how to have 500 unique strings for
item_code
in in_cluster
, and have the same strings in real_sales
as item_code
) (我没有使用假人数据集的字符串,因为我不知道如何有500个独特的字符串
item_code
在in_cluster
,并且具有相同的字符串real_sales
为item_code
)
I read through few other articles which suggested ways to optimize the R code and used bind_rows
instead of rbind
or using: 我通读了其他一些文章,这些文章建议了优化R代码的方法,并使用
bind_rows
而不是rbind
或使用:
training_df[nrow(training_df) + 1,] <-
c(mean_trajectory$sales[[week-1]], mean_trajectory$sales[[week]], mean_trajectory$sales[[week+1]], sale_row$sales[[week-1]], sale_row$sales[[week]], sale_row$sales[[week+1]])
using bind_rows seems to have improved the performance by 36 seconds when looping through 500 rows of data frame in_cluster
遍历500行数据帧
in_cluster
时,使用bind_rows似乎将性能提高了36秒
Is it possible to use lapply in this scenario? 在这种情况下可以使用lapply吗? I tried code below and got an error:
我在下面尝试了代码,但出现错误:
Error in filter_impl(.data, dots) : $ operator is invalid for atomic vectors
filter_impl(.data,点)中的错误:$运算符对原子向量无效
myfun <- function(item, sales, mean_trajectory, week) {
sale_row<- filter(sales, item_code == item$item_code)
data.frame(
LTF_t_minus_1 = mean_trajectory$sales[[week-1]],
LTF_t = mean_trajectory$sales[[week]],
LTF_t_plus_1 = mean_trajectory$sales[[week+1]],
RS_t_minus_1 = sale_row$sales[[week-1]],
RS_t = sale_row$sales[[week]],
STF_t_plus_1 = sale_row$sales[[week+1]])
}
system.time({
lapply(in_cluster, myfun, sales= sales, mean_trajectory = mean_trajectory) %>% bind_rows()
})
Help with lapply
would be appreciated, however my main target is to speed up the loop. 对于
lapply
帮助将不胜感激,但是我的主要目标是加快循环速度。
Ok, so there a lot of bad practices in your code. 好的,因此您的代码中存在许多不良做法。
training_df <- bind_rows(training_df, new_df)
keeps growing in each iteration while running a pretty expensive operation ( bind_rows
)) training_df <- bind_rows(training_df, new_df)
在每次迭代中都在不断增长,同时运行相当昂贵的操作( bind_rows
)) mean_trajectory$sales[[week-1]]
and al per row while mean_trajectory
has nothing to do with the loop? You could just assign it afterwards). mean_trajectory$sales[[week-1]]
和每行一次,而mean_trajectory
与循环无关吗?之后)。 I would suggest an alternative simple data.table
solution which will perform much better. 我建议使用另一种简单的
data.table
解决方案,它的性能会更好。 The idea is to first make a binary join between in_cluster
and real_sales
(and run all the operations while joining without creating extra data frames and then binding them). 这个想法是首先在
in_cluster
和real_sales
之间in_cluster
二进制连接(并在连接时运行所有操作,而不创建额外的数据帧,然后将它们绑定)。 Then, run all the mean_trajectory
related lines only once. 然后,只运行一次所有与
mean_trajectory
相关的行。 (I ignored the training_df[nrow(training_df) + 1, ] <- c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19)
initialization as it's irrelevant here and you can just add it afterwards using and rbind
) (我忽略了
training_df[nrow(training_df) + 1, ] <- c(0, 0, mean_trajectory$sales[[1]], 0, 0, 19)
初始化,因为此处无关紧要,您可以在以后使用和rbind
)
library(data.table) #v1.10.4
## First step
res <-
setDT(real_sales)[setDT(in_cluster), # binary join
if(.N > 2) .(RS_t_minus_1 = sales[week - 1], # The stuff you want to do
RS_t = sales[week], # by condition
STF_t_plus_1 = sales[week + 1]),
on = "item_code", # The join key
by = .EACHI] # Do the operations per each join
## Second step (run the `mean_trajectory` only once)
res[, `:=`(LTF_t_minus_1 = mean_trajectory$sales[week - 1],
LTF_t = mean_trajectory$sales[week],
LTF_t_plus_1 = mean_trajectory$sales[week + 1])]
Some benchmarks: 一些基准:
### Creating your data sets
set.seed(123)
N <- 1e5
N2 <- 5e7
in_cluster <- data.frame(item_code = c(1:N))
real_sales <-
data.frame(
item_code = sample(N, size = N2, replace = TRUE),
sales = sample(N, size = N2, replace = TRUE)
)
mean_trajectory <- data.frame(sales = sample(N, size = 25, replace = TRUE))
training_df <- data.frame(
LTF_t_minus_1 = numeric(0),
LTF_t = numeric(0),
LTF_t_plus_1 = numeric(0),
RS_t_minus_1 = numeric(0),
RS_t = numeric(0),
STF_t_plus_1 = numeric(0)
)
week = 2
###############################
################# Your solution
system.time({
for (r in 1:nrow(in_cluster)) {
item <- in_cluster[r,, drop = FALSE]
sale_row <-
dplyr::filter(real_sales, item_code == item$item_code)
if (nrow(sale_row) > 2) {
new_df <- data.frame(
LTF_t_minus_1 = mean_trajectory$sales[[week - 1]],
LTF_t = mean_trajectory$sales[[week]],
LTF_t_plus_1 = mean_trajectory$sales[[week + 1]],
RS_t_minus_1 = sale_row$sales[[week - 1]],
RS_t = sale_row$sales[[week]],
STF_t_plus_1 = sale_row$sales[[week + 1]]
)
training_df <-
bind_rows(training_df, new_df)
}
}
})
### Ran forever- I've killed it after half an hour
######################
########## My solution
library(data.table)
system.time({
res <-
setDT(real_sales)[setDT(in_cluster),
if(.N > 2) .(RS_t_minus_1 = sales[week - 1],
RS_t = sales[week],
STF_t_plus_1 = sales[week + 1]),
on = "item_code",
by = .EACHI]
res[, `:=`(LTF_t_minus_1 = mean_trajectory$sales[week - 1],
LTF_t = mean_trajectory$sales[week],
LTF_t_plus_1 = mean_trajectory$sales[week + 1])]
})
# user system elapsed
# 2.42 0.05 2.47
So for 50MM rows the data.table
solution ran for about 2 secs, while your solution ran endlessly until I've killed it (after half an hour). 因此,对于
data.table
行, data.table
解决方案运行了大约2秒钟,而您的解决方案却无休止地运行,直到我将其杀死(半小时后)。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.