简体   繁体   English

在 R 中向量化 for 循环

[英]Vectorizing a for loop in R

I am trying to improve my code by benefiting from R's vectorization like using more apply family functions instead of a for loop, since the dataset that I work with reach 300K records, and I'd love to be able to cut down time on the script running.我试图通过受益于 R 的矢量化来改进我的代码,比如使用更多的应用系列函数而不是 for 循环,因为我使用的数据集达到 300K 记录,我希望能够减少脚本的时间跑步。

I have prepared a repex as well as the actual for loop, I just don't have an idea whether it is possible to transform it into a non-loop structure.我已经准备了一个 repex 以及实际的 for 循环,我只是不知道是否可以将其转换为非循环结构。

Here it goes:它是这样的:

df <- structure(list(time = structure(c(1500697800, 1500698100, 1500698400, 
                                        1500698700, 1500699000, 1500699300, 1500699600, 1500699900, 1500700200, 
                                        1500700500, 1500700800, 1500701100, 1500701400, 1500701700, 1500702000, 
                                        1500702300, 1500702600, 1500702900, 1500703200, 1500703500, 1500703800, 
                                        1500704100, 1500704400, 1500704700, 1500705000, 1500705300, 1500705600, 
                                        1500705900, 1500706200, 1500706500, 1500706800, 1500707100, 1500707400, 
                                        1500707700, 1500708000, 1500708300, 1500708600, 1500708900, 1500709200, 
                                        1500709500, 1500709800, 1500710100, 1500710400, 1500710700, 1500711000, 
                                        1500711300, 1500711600, 1500711900, 1500712200, 1500712500, 1500712800, 
                                        1500713100, 1500713400, 1500713700, 1500714000, 1500714300, 1500714600, 
                                        1500714900, 1500715200, 1500715500, 1500715800, 1500716100, 1500716400, 
                                        1500716700, 1500717000, 1500717300, 1500717600, 1500717900, 1500718200, 
                                        1500718500, 1500718800, 1500719100, 1500719400, 1500719700, 1500720000, 
                                        1500720300, 1500720600, 1500720900, 1500721200, 1500721500, 1500721800, 
                                        1500722100, 1500722400, 1500722700, 1500723000, 1500723300, 1500723600, 
                                        1500723900, 1500724200, 1500724500, 1500724800, 1500725100, 1500725400, 
                                        1500725700, 1500726000, 1500726300, 1500726600, 1500726900, 1500727200, 
                                        1500727500, 1500727800, 1500728100, 1500728400, 1500728700, 1500729000, 
                                        1500729300, 1500729600, 1500729900, 1500730200, 1500730500, 1500730800, 
                                        1500731100, 1500731400, 1500731700, 1500732000, 1500732300, 1500732600, 
                                        1500732900, 1500733200, 1500733500, 1500733800, 1500734100, 1500734400, 
                                        1500734700, 1500735000, 1500735300, 1500735600, 1500735900, 1500736200, 
                                        1500736500, 1500736800, 1500737100, 1500737400, 1500737700, 1500738000, 
                                        1500738300, 1500738600, 1500738900, 1500739200, 1500739500, 1500739800, 
                                        1500740100, 1500740400, 1500740700, 1500741000), class = c("POSIXct", 
                                                                                                   "POSIXt"), tzone = "UTC"), rate = c(8021.22624828867, 8022.17252092756, 
                                                                                                                                       4026.57093082574, 0, 0, 0, 0, 0, 0, 0, 0, 1092.48742657481, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2352.47712160156, 0, 0, 0, 0, 0, 
                                                                                                                                       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0), is.rate = c("OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", 
                                                                                                                                                                                     "OFF", "OFF", "OFF", "OFF", "OFF", "OFF", "OFF")), class = c("tbl_df", 
                                                                                                                                                                                                                                                  "tbl", "data.frame"), row.names = c(NA, -145L))


To quickly explain the data: it has a time variable,some rate, and a flag for when rate is not 0 --> ON.快速解释数据:它有一个时间变量、一些速率和一个标志,当速率不为 0 --> ON 时。

The idea of the for loop is that it will pick up on rate values above 0 and from the perspective of the time will "tail" the is.rate flag onwards for the next hour. for 循环的想法是它会接收高于 0 的速率值,并且从时间的角度来看,它将在下一小时“拖尾”is.rate 标志。 I know it sounds complicated, but once you run the for loop on the repex, it should make sense.我知道这听起来很复杂,但是一旦你在 repex 上运行 for 循环,它应该是有意义的。

Talking about the for-loop, here it is:谈到 for 循环,这里是:

for (i in which(temp_df$rate != 0)) {
  temp_df$is.rate[i:(i + 12)] <- "ON" # 12 in this case is a factor of lag-time. Since data is in 5 min intervals, this means the next hour
}

I'd love to try to optimize this code, and preferably fully remove the for-loop and use something similar to apply family function, but I can't really see the code structure.我很想尝试优化此代码,最好完全删除for循环并使用类似的东西来应用家庭功能,但我无法真正看到代码结构。

I think you are looking for "ON" to be set when rate > 0 and lag for the next 11 rows.我认为您正在寻找在rate > 0时设置"ON"并在接下来的 11 行滞后。

My comment above failed to include align="right" , necessary to get what I think it the logic you want.我上面的评论没有包含align="right" ,这是获得我认为你想要的逻辑所必需的。 Try this:尝试这个:

zoo::rollapply(df$rate > 0, 12, any, align = "right", partial = TRUE)
#   [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
#  [13]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
#  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
# [133]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
# [145] FALSE
ifelse(zoo::rollapply(df$rate > 0, 12, any, align = "right", partial = TRUE), "YES", "NO")
#   [1] "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES"
#  [13] "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "YES" "NO" 

resulting in this data:导致此数据:

print(df, n=26)
# # A tibble: 145 x 3
#    time                 rate is.rate
#    <dttm>              <dbl> <chr>  
#  1 2017-07-22 04:30:00 8021. YES    
#  2 2017-07-22 04:35:00 8022. YES    
#  3 2017-07-22 04:40:00 4027. YES    
#  4 2017-07-22 04:45:00    0  YES    
#  5 2017-07-22 04:50:00    0  YES    
#  6 2017-07-22 04:55:00    0  YES    
#  7 2017-07-22 05:00:00    0  YES    
#  8 2017-07-22 05:05:00    0  YES    
#  9 2017-07-22 05:10:00    0  YES    
# 10 2017-07-22 05:15:00    0  YES    
# 11 2017-07-22 05:20:00    0  YES    ### counting rows from last non-zero rate
# 12 2017-07-22 05:25:00 1092. YES    1
# 13 2017-07-22 05:30:00    0  YES    2
# 14 2017-07-22 05:35:00    0  YES    3
# 15 2017-07-22 05:40:00    0  YES    4
# 16 2017-07-22 05:45:00    0  YES    5
# 17 2017-07-22 05:50:00    0  YES    6
# 18 2017-07-22 05:55:00    0  YES    7
# 19 2017-07-22 06:00:00    0  YES    8
# 20 2017-07-22 06:05:00    0  YES    9
# 21 2017-07-22 06:10:00    0  YES    10
# 22 2017-07-22 06:15:00    0  YES    11
# 23 2017-07-22 06:20:00    0  YES    12
# 24 2017-07-22 06:25:00    0  NO     
# 25 2017-07-22 06:30:00    0  NO     
# 26 2017-07-22 06:35:00    0  NO     
# # ... with 119 more rows

I think what you need to do is find out indices where rate != 0 , create a sequence between those indices and inds + 12 and assign is.rate for those indices to "ON" .我认为您需要做的是找出rate != 0索引,在这些索引和inds + 12之间创建一个序列,并将这些索引的is.rate分配给"ON"

inds <- which(temp_df$rate != 0)
temp_df$is.rate[unique(c(mapply(`:`, inds, inds + 12)))] <- "ON"

It gives the same output as the for loop.它提供与for循环相同的输出。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM