加速双循环

Question

I am having an issue with the length of time it's taking to run a double for loop with an if statement within R. In one data set I have about 3000000 rows (DF1) and in the other I have about 22 (DF2). 我在R中使用if语句运行double for循环所花的时间长短有一个问题。在一个数据集中，我大约有3000000行（DF1），而在另一个数据集中我大约有22行（DF2）。 An example of the two data frames I have are given below. 下面是我拥有的两个数据帧的示例。

DF1
DateTime                 REG
2018-07-01 12:00:00      NHDG
2018-07-12 11:55:23      NSKR

DF2
StartDateTime           EndDateTime         Direction
2018-07-01 07:55:11    2018-07-01 12:01:56     W
2018-07-12 11:00:23    2018-07-12 11:45:00     E

I want to flag anything in DF1 when the DateTime is between StartDateTime and EndDateTime. 当DateTime在StartDateTime和EndDateTime之间时，我想标记DF1中的任何内容。 Hence the output will be as follows: 因此，输出将如下所示：

DF1  
DateTime                 REG      Flag
2018-07-01 12:00:00      NHDG      1
2018-07-12 11:55:23      NSKR      0

The code I have used currently is: 我目前使用的代码是：

#Flag if in delay or not
DF1$Flag<-0

for (i in 1:nrow(DF1)){
  for (j in 1:nrow(DF2)){
    if ((DF1$DateTime[i] >= DF2$StartDateTime[j]) & (DF1$DateTime <= DF2$EndDateTime[j])){
      DF1$Flag[i]<-1
    } else {
      DF1$Flag[i]<-DF1$Flag
    }
  }
}

I am more than happy for this code to be taken out of the for loops if possible. 如果可能，我很高兴将此代码从for循环中删除。

Answer 1

If I understand properly, the value of Flag in DF1 should be set to 1 if the DateTime is between any interval from DF2 , right? 如果我正确理解的话，如果DateTime在DF2 任何间隔之间，那么DF1的Flag的值应该设置为1，对吗？ Then, the following base code would do the job: 然后，以下基本代码将完成这项工作：

DF1$Flag = sapply(DF1$DateTime, 
                  function(x) as.integer(sum(x >= DF2$StartDateTime & 
                                               x <= DF2$EndDateTime) > 0))
#              DateTime  REG Flag
# 1 2018-07-01 12:00:00 NHDG    1
# 2 2018-07-12 11:55:23 NSKR    0

The idea is to vectorize the comparison: for each DateTime in DF1 (sort of "looping" through sapply ), you compare the value to all intervals (Start- and EndDateTime) from DF2 and you sum the results: if the sum is greater than 0, then you have at least one line in DF2 where DateTime from DF1 falls between its Start- and EndDateTime. 想法是对比较进行矢量化处理：对于DF1每个DateTime（通过sapply进行“循环” sapply ），您将值与DF2所有间隔（Start-和EndDateTime）进行比较，并对结果sum ：如果sum大于0，则DF2中至少有一行，其中DF1中的DateTime介于其Start-和EndDateTime之间。 Then as.integer converts the boolean output of sum(...) > 0 to 1 or 0 . 然后as.integer将sum(...) > 0的布尔输出转换为1或0 。

And, if you want a faster solution, using dplyr : 而且，如果您想要更快的解决方案，请使用dplyr ：

df1 = full_join(mutate(DF1, foo=1), mutate(DF2, foo=1), by='foo') %>% 
  mutate(Flag = as.integer(DateTime >= StartDateTime & DateTime <= EndDateTime)) %>%
  group_by(DateTime) %>% slice(which.max(Flag)) %>%
  select(DateTime, REG, Flag)

Otherwise: There seems to be a problem with you second loop, over the rows of DF2 (j loop): for each row of DF1 , you compare the date to the start and end dates of successively all rows of DF2 , basically overwriting every time the resulting Flag value and only keeping the result for the comparison with the very last row of DF2 ...? 否则：在DF2的行（j循环）上，您的第二个循环似乎有问题：对于DF1每一行，您将日期与DF2 所有行的开始日期和结束日期进行比较，基本上每次都覆盖得到的标志值，仅保留与DF2最后一行进行比较的结果...？ In other words, i in DF1$Flag[i] <- ... does not move inside the j loop (and is each time overwritten). 换句话说， i在DF1$Flag[i] <- ...不会在内部移动j环路（并且是每个重写时间）。

So if you just want to compare between the min and max date range from DF2 , you can simply do: 因此，如果您只想比较DF2的最小和最大日期范围，则可以执行以下操作：

DF1$Flag = as.integer((DF1$DateTime >= min(DF2$StartDateTime)) & (DF1$DateTime <= max(DF2$EndDateTime)))

Answer 2

What about this? 那这个呢？

library(data.table)
DF1$flag <- as.numeric(sapply(seq(nrow(DF1)), function(x)
  DF1[x, "DateTime"] %between% c(min(DF2[x, "StartDateTime"]), max(DF2[x, "EndDateTime"]))))
#              DateTime  REG flag
# 1 2018-07-01 12:00:00 NHDG    1
# 2 2018-07-12 11:55:23 NSKR    0

Data 数据

> dput(DF1)
structure(list(DateTime = structure(1:2, .Label = c("2018-07-01 12:00:00", 
"2018-07-12 11:55:23"), class = "factor"), REG = structure(1:2, .Label = c("NHDG", 
"NSKR"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L))
> dput(DF2)
structure(list(StartDateTime = structure(1:2, .Label = c("2018-07-01 07:55:11", 
"2018-07-12 11:00:23"), class = "factor"), EndDateTime = structure(1:2, .Label = c("2018-07-01 12:01:56", 
"2018-07-12 11:45:00"), class = "factor"), Direction = structure(2:1, .Label = c("E", 
"W"), class = "factor")), class = "data.frame", row.names = c(NA, 
-2L)) 

DF1$DateTime <- as.POSIXct(DF1$DateTime)
DF2$StartDateTime <- as.POSIXct(DF2$StartDateTime)
DF2$EndDateTime <- as.POSIXct(DF2$EndDateTime)

Answer 3

Could also go for foverlaps : 也可能会foverlaps ：

library(data.table)

setDT(DF1)[, DateTime := as.POSIXct(DateTime)][, EndDateTime := DateTime]
setDT(DF2)[, `:=` (StartDateTime = as.POSIXct(StartDateTime), 
                   EndDateTime = as.POSIXct (EndDateTime))]

setkey(DF1, DateTime, EndDateTime)
setkey(DF2, StartDateTime, EndDateTime)

DF1[, Flag := foverlaps(DF1, DF2, type = "within", which = TRUE, mult = "first")][
  is.na(Flag), Flag := 0][, EndDateTime := NULL]

This will check for every date in DF1 if it is situated in any interval in DF2 . 这将检查DF1每个日期是否位于DF2中的任何间隔中。

It'll also be fast, at least according to my tests. 至少根据我的测试，它也会很快。 Benchmark with sapply : sapply基准测试：

Unit: milliseconds
   expr         min           lq        mean      median           uq        max neval
     DT    4.752853     5.247319    18.38787     5.42855     6.950966   311.1944    25
 sapply 9413.337014 10598.926908 11206.14866 10892.91751 11746.901293 13568.7995    25

This is on a dataset with 10 000 rows in DF1 and 12 in DF2 . 该数据集在DF1具有10000行，在DF2 12行。

I only ran it once on 300 000 / 22 rows, and this is what I get: 我只在300 000/22行上运行了一次，这就是我得到的：

Unit: seconds
   expr       min        lq      mean    median        uq       max neval
     DT  11.60865  11.60865  11.60865  11.60865  11.60865  11.60865     1
 sapply 674.05823 674.05823 674.05823 674.05823 674.05823 674.05823     1

Answer 4

One faster way would be to use crossing() from tidyr to cross df1 and df2, set the flag per row in the new data frame then use aggregate() to reduce the rows back down. 一种更快的方法是使用从tidyr到crossing df1和df2的crossing（），在新数据帧中为每行设置标志，然后使用aggregate（）减少返回的行数。 This method assumes that there are no duplicate entries in df1. 此方法假定df1中没有重复的条目。 If there are, they will be combined. 如果有，它们将被合并。

> df1
             DateTime  REG
1 2018-07-01 12:00:00 NHDG
2 2018-07-12 11:55:23 NSKR
> df2
        StartDateTime         EndDateTime Direction
1 2018-07-01 07:55:11 2018-07-01 12:01:56         W
2 2018-07-12 11:00:23 2018-07-12 11:45:00         E
> # Create a DF with rows for each combination of df1 rows with df2 rows
> tmp <- crossing(df1, df2)
> tmp
             DateTime  REG       StartDateTime         EndDateTime Direction
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56         W
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00         E
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56         W
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00         E
> # Create a new column for the flag
> tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
> tmp
             DateTime  REG       StartDateTime         EndDateTime Direction  flag
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56         W  TRUE
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00         E FALSE
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56         W FALSE
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00         E FALSE
> # Drop the unwanted columns
> tmp <- tmp[,c("DateTime", "REG", "flag")]
> tmp
             DateTime  REG  flag
1 2018-07-01 12:00:00 NHDG  TRUE
2 2018-07-01 12:00:00 NHDG FALSE
3 2018-07-12 11:55:23 NSKR FALSE
4 2018-07-12 11:55:23 NSKR FALSE
> # Sum all flags for a given df1 date and limit total to 1
> df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
> df1
             DateTime  REG flag
1 2018-07-01 12:00:00 NHDG    1
2 2018-07-12 11:55:23 NSKR    0
>

Running with many more dates and comparing against your original for loop and the sapply() method above: 运行更多日期，并与原始的for循环和上述sapply（）方法进行比较：

  Original for loop method: 6.282 sec elapsed
           sapply() method:  1.65 sec elapsed
crossing() and aggregate(): 0.385 sec elapsed

The full script is here: 完整的脚本在这里：

#!/usr/bin/env Rscript                                                                                                                              

library(tictoc)
library(tidyr)

# Setup: generate a lot of dates for performance comparison                                                                                         

beg <- as.POSIXct("2018-07-01 12:00:00")
end <- as.POSIXct("2100-12-01 12:00:00")
dates <- seq(beg, end, 60*60*24)

#df1 <- data.frame(c("2018-07-01 12:00:00", "2018-07-12 11:55:23"), c("NHDG","NSKR"))                                                               
df1 <- data.frame(dates, rep(c("NHDG","NSKR"), length(dates)/2))
df2 <- data.frame(c("2018-07-01 07:55:11", "2018-07-12 11:00:23"), c("2018-07-01 12:01:56", "2018-07-12 11:45:00"), c("W","E"))
colnames(df1) <- c("DateTime", "REG")
colnames(df2) <- c("StartDateTime","EndDateTime","Direction")

df1$DateTime <- as.POSIXct(df1$DateTime, tz = "America/Los_Angeles")
df2$StartDateTime <- as.POSIXct(df2$StartDateTime, tz = "America/Los_Angeles")
df2$EndDateTime <- as.POSIXct(df2$EndDateTime, tz = "America/Los_Angeles")

# Original (fixed)                                                                                                                                  

tic(sprintf("%30s", "Original for loop method"))

for (i in 1:nrow(df1)){
  df1$flag[i] <- 0
  for (j in 1:nrow(df2)){
    if ((df1$DateTime[i] >= df2$StartDateTime[j]) & (df1$DateTime[i] <= df2$EndDateTime[j])){
      df1$flag[i]<-1
      break
    }
  }
}

toc()

result1 <- df1
df1$flag <- NULL

# Sapply                                                                                                                                            

tic(sprintf("%30s", "sapply() method"))

df1$flag = sapply(df1$DateTime,
                  function(x) as.integer(sum(x >= df2$StartDateTime &
                                             x <= df2$EndDateTime) > 0))
toc()

result2 <- df1
df1$flag <- NULL

# Aggregate                                                                                                                                         

tic(sprintf("%30s", "crossing() and aggregate()"))

# Create a DF with rows for each combination of df1 rows with df2 rows                                                                              
tmp <- crossing(df1, df2)
# Create a new column for the flag                                                                                                                  
tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
# Drop the unwanted columns                                                                                                                         
tmp <- tmp[,c("DateTime", "REG", "flag")]
# Sum all flags for a given df1 date and limit total to 1                                                                                           
df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
# Sort the rows by date                                                                                                                             
df1 <- df1[order(df1$DateTime),]
# Reset the row names (for comparison below)                                                                                                        
rownames(df1) <- NULL

toc()

result3 <- df1

# Prove that results are the same                                                                                                                   

if (!all.equal(result1, result2)) {
  print("MISMATCH")
  stop()
}

if (!all.equal(result1, result3)) {
  print(MISMATCH)
  stop()
}

print("PASS")

加速双循环

问题描述

4 个解决方案

解决方案1
2 2019-02-12 10:40:31

解决方案2
0 2019-02-12 10:52:17

解决方案3
0 2019-02-12 11:00:03

解决方案4
0 2019-02-12 13:10:38

加速双循环

问题描述

4 个解决方案

解决方案1 2 2019-02-12 10:40:31

解决方案2 0 2019-02-12 10:52:17

解决方案3 0 2019-02-12 11:00:03

解决方案4 0 2019-02-12 13:10:38

解决方案1
2 2019-02-12 10:40:31

解决方案2
0 2019-02-12 10:52:17

解决方案3
0 2019-02-12 11:00:03

解决方案4
0 2019-02-12 13:10:38