[英]Speed up a double for loop
I am having an issue with the length of time it's taking to run a double for loop with an if statement within R. In one data set I have about 3000000 rows (DF1) and in the other I have about 22 (DF2). 我在R中使用if语句运行double for循环所花的时间长短有一个问题。在一个数据集中,我大约有3000000行(DF1),而在另一个数据集中我大约有22行(DF2)。 An example of the two data frames I have are given below.
下面是我拥有的两个数据帧的示例。
DF1
DateTime REG
2018-07-01 12:00:00 NHDG
2018-07-12 11:55:23 NSKR
DF2
StartDateTime EndDateTime Direction
2018-07-01 07:55:11 2018-07-01 12:01:56 W
2018-07-12 11:00:23 2018-07-12 11:45:00 E
I want to flag anything in DF1 when the DateTime is between StartDateTime and EndDateTime. 当DateTime在StartDateTime和EndDateTime之间时,我想标记DF1中的任何内容。 Hence the output will be as follows:
因此,输出将如下所示:
DF1
DateTime REG Flag
2018-07-01 12:00:00 NHDG 1
2018-07-12 11:55:23 NSKR 0
The code I have used currently is: 我目前使用的代码是:
#Flag if in delay or not
DF1$Flag<-0
for (i in 1:nrow(DF1)){
for (j in 1:nrow(DF2)){
if ((DF1$DateTime[i] >= DF2$StartDateTime[j]) & (DF1$DateTime <= DF2$EndDateTime[j])){
DF1$Flag[i]<-1
} else {
DF1$Flag[i]<-DF1$Flag
}
}
}
I am more than happy for this code to be taken out of the for loops if possible. 如果可能,我很高兴将此代码从for循环中删除。
If I understand properly, the value of Flag in DF1
should be set to 1 if the DateTime is between any interval from DF2
, right? 如果我正确理解的话,如果DateTime在
DF2
任何间隔之间,那么DF1
的Flag的值应该设置为1,对吗? Then, the following base code would do the job: 然后,以下基本代码将完成这项工作:
DF1$Flag = sapply(DF1$DateTime,
function(x) as.integer(sum(x >= DF2$StartDateTime &
x <= DF2$EndDateTime) > 0))
# DateTime REG Flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
The idea is to vectorize the comparison: for each DateTime in DF1
(sort of "looping" through sapply
), you compare the value to all intervals (Start- and EndDateTime) from DF2
and you sum
the results: if the sum
is greater than 0, then you have at least one line in DF2
where DateTime from DF1
falls between its Start- and EndDateTime. 想法是对比较进行矢量化处理:对于
DF1
每个DateTime(通过sapply
进行“循环” sapply
),您将值与DF2
所有间隔(Start-和EndDateTime)进行比较,并对结果sum
:如果sum
大于0,则DF2
中至少有一行,其中DF1
中的DateTime介于其Start-和EndDateTime之间。 Then as.integer
converts the boolean output of sum(...) > 0
to 1
or 0
. 然后
as.integer
将sum(...) > 0
的布尔输出转换为1
或0
。
And, if you want a faster solution, using dplyr
: 而且,如果您想要更快的解决方案,请使用
dplyr
:
df1 = full_join(mutate(DF1, foo=1), mutate(DF2, foo=1), by='foo') %>%
mutate(Flag = as.integer(DateTime >= StartDateTime & DateTime <= EndDateTime)) %>%
group_by(DateTime) %>% slice(which.max(Flag)) %>%
select(DateTime, REG, Flag)
Otherwise: There seems to be a problem with you second loop, over the rows of DF2
(j loop): for each row of DF1
, you compare the date to the start and end dates of successively all rows of DF2
, basically overwriting every time the resulting Flag value and only keeping the result for the comparison with the very last row of DF2
...? 否则:在
DF2
的行(j循环)上,您的第二个循环似乎有问题:对于DF1
每一行,您将日期与DF2
所有行的开始日期和结束日期进行比较,基本上每次都覆盖得到的标志值,仅保留与DF2
最后一行进行比较的结果...? In other words, i
in DF1$Flag[i] <- ...
does not move inside the j
loop (and is each time overwritten). 换句话说,
i
在DF1$Flag[i] <- ...
不会在内部移动j
环路(并且是每个重写时间)。
So if you just want to compare between the min and max date range from DF2
, you can simply do: 因此,如果您只想比较
DF2
的最小和最大日期范围,则可以执行以下操作:
DF1$Flag = as.integer((DF1$DateTime >= min(DF2$StartDateTime)) & (DF1$DateTime <= max(DF2$EndDateTime)))
What about this? 那这个呢?
library(data.table)
DF1$flag <- as.numeric(sapply(seq(nrow(DF1)), function(x)
DF1[x, "DateTime"] %between% c(min(DF2[x, "StartDateTime"]), max(DF2[x, "EndDateTime"]))))
# DateTime REG flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
Data 数据
> dput(DF1)
structure(list(DateTime = structure(1:2, .Label = c("2018-07-01 12:00:00",
"2018-07-12 11:55:23"), class = "factor"), REG = structure(1:2, .Label = c("NHDG",
"NSKR"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
> dput(DF2)
structure(list(StartDateTime = structure(1:2, .Label = c("2018-07-01 07:55:11",
"2018-07-12 11:00:23"), class = "factor"), EndDateTime = structure(1:2, .Label = c("2018-07-01 12:01:56",
"2018-07-12 11:45:00"), class = "factor"), Direction = structure(2:1, .Label = c("E",
"W"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
DF1$DateTime <- as.POSIXct(DF1$DateTime)
DF2$StartDateTime <- as.POSIXct(DF2$StartDateTime)
DF2$EndDateTime <- as.POSIXct(DF2$EndDateTime)
Could also go for foverlaps
: 也可能会
foverlaps
:
library(data.table)
setDT(DF1)[, DateTime := as.POSIXct(DateTime)][, EndDateTime := DateTime]
setDT(DF2)[, `:=` (StartDateTime = as.POSIXct(StartDateTime),
EndDateTime = as.POSIXct (EndDateTime))]
setkey(DF1, DateTime, EndDateTime)
setkey(DF2, StartDateTime, EndDateTime)
DF1[, Flag := foverlaps(DF1, DF2, type = "within", which = TRUE, mult = "first")][
is.na(Flag), Flag := 0][, EndDateTime := NULL]
This will check for every date in DF1
if it is situated in any interval in DF2
. 这将检查
DF1
每个日期是否位于DF2
中的任何间隔中。
It'll also be fast, at least according to my tests. 至少根据我的测试,它也会很快。 Benchmark with
sapply
: sapply
基准测试:
Unit: milliseconds
expr min lq mean median uq max neval
DT 4.752853 5.247319 18.38787 5.42855 6.950966 311.1944 25
sapply 9413.337014 10598.926908 11206.14866 10892.91751 11746.901293 13568.7995 25
This is on a dataset with 10 000 rows in DF1
and 12 in DF2
. 该数据集在
DF1
具有10000行,在DF2
12行。
I only ran it once on 300 000 / 22 rows, and this is what I get: 我只在300 000/22行上运行了一次,这就是我得到的:
Unit: seconds
expr min lq mean median uq max neval
DT 11.60865 11.60865 11.60865 11.60865 11.60865 11.60865 1
sapply 674.05823 674.05823 674.05823 674.05823 674.05823 674.05823 1
One faster way would be to use crossing() from tidyr to cross df1 and df2, set the flag per row in the new data frame then use aggregate() to reduce the rows back down. 一种更快的方法是使用从tidyr到crossing df1和df2的crossing(),在新数据帧中为每行设置标志,然后使用aggregate()减少返回的行数。 This method assumes that there are no duplicate entries in df1.
此方法假定df1中没有重复的条目。 If there are, they will be combined.
如果有,它们将被合并。
> df1
DateTime REG
1 2018-07-01 12:00:00 NHDG
2 2018-07-12 11:55:23 NSKR
> df2
StartDateTime EndDateTime Direction
1 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a DF with rows for each combination of df1 rows with df2 rows
> tmp <- crossing(df1, df2)
> tmp
DateTime REG StartDateTime EndDateTime Direction
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a new column for the flag
> tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
> tmp
DateTime REG StartDateTime EndDateTime Direction flag
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W TRUE
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W FALSE
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
> # Drop the unwanted columns
> tmp <- tmp[,c("DateTime", "REG", "flag")]
> tmp
DateTime REG flag
1 2018-07-01 12:00:00 NHDG TRUE
2 2018-07-01 12:00:00 NHDG FALSE
3 2018-07-12 11:55:23 NSKR FALSE
4 2018-07-12 11:55:23 NSKR FALSE
> # Sum all flags for a given df1 date and limit total to 1
> df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
> df1
DateTime REG flag
1 2018-07-01 12:00:00 NHDG 1
2 2018-07-12 11:55:23 NSKR 0
>
Running with many more dates and comparing against your original for loop and the sapply() method above: 运行更多日期,并与原始的for循环和上述sapply()方法进行比较:
Original for loop method: 6.282 sec elapsed
sapply() method: 1.65 sec elapsed
crossing() and aggregate(): 0.385 sec elapsed
The full script is here: 完整的脚本在这里:
#!/usr/bin/env Rscript
library(tictoc)
library(tidyr)
# Setup: generate a lot of dates for performance comparison
beg <- as.POSIXct("2018-07-01 12:00:00")
end <- as.POSIXct("2100-12-01 12:00:00")
dates <- seq(beg, end, 60*60*24)
#df1 <- data.frame(c("2018-07-01 12:00:00", "2018-07-12 11:55:23"), c("NHDG","NSKR"))
df1 <- data.frame(dates, rep(c("NHDG","NSKR"), length(dates)/2))
df2 <- data.frame(c("2018-07-01 07:55:11", "2018-07-12 11:00:23"), c("2018-07-01 12:01:56", "2018-07-12 11:45:00"), c("W","E"))
colnames(df1) <- c("DateTime", "REG")
colnames(df2) <- c("StartDateTime","EndDateTime","Direction")
df1$DateTime <- as.POSIXct(df1$DateTime, tz = "America/Los_Angeles")
df2$StartDateTime <- as.POSIXct(df2$StartDateTime, tz = "America/Los_Angeles")
df2$EndDateTime <- as.POSIXct(df2$EndDateTime, tz = "America/Los_Angeles")
# Original (fixed)
tic(sprintf("%30s", "Original for loop method"))
for (i in 1:nrow(df1)){
df1$flag[i] <- 0
for (j in 1:nrow(df2)){
if ((df1$DateTime[i] >= df2$StartDateTime[j]) & (df1$DateTime[i] <= df2$EndDateTime[j])){
df1$flag[i]<-1
break
}
}
}
toc()
result1 <- df1
df1$flag <- NULL
# Sapply
tic(sprintf("%30s", "sapply() method"))
df1$flag = sapply(df1$DateTime,
function(x) as.integer(sum(x >= df2$StartDateTime &
x <= df2$EndDateTime) > 0))
toc()
result2 <- df1
df1$flag <- NULL
# Aggregate
tic(sprintf("%30s", "crossing() and aggregate()"))
# Create a DF with rows for each combination of df1 rows with df2 rows
tmp <- crossing(df1, df2)
# Create a new column for the flag
tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
# Drop the unwanted columns
tmp <- tmp[,c("DateTime", "REG", "flag")]
# Sum all flags for a given df1 date and limit total to 1
df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
# Sort the rows by date
df1 <- df1[order(df1$DateTime),]
# Reset the row names (for comparison below)
rownames(df1) <- NULL
toc()
result3 <- df1
# Prove that results are the same
if (!all.equal(result1, result2)) {
print("MISMATCH")
stop()
}
if (!all.equal(result1, result3)) {
print(MISMATCH)
stop()
}
print("PASS")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.