[英]Speed up a double for loop
我在R中使用if語句運行double for循環所花的時間長短有一個問題。在一個數據集中,我大約有3000000行(DF1),而在另一個數據集中我大約有22行(DF2)。 下面是我擁有的兩個數據幀的示例。
DF1
DateTime REG
2018-07-01 12:00:00 NHDG
2018-07-12 11:55:23 NSKR
DF2
StartDateTime EndDateTime Direction
2018-07-01 07:55:11 2018-07-01 12:01:56 W
2018-07-12 11:00:23 2018-07-12 11:45:00 E
當DateTime在StartDateTime和EndDateTime之間時,我想標記DF1中的任何內容。 因此,輸出將如下所示:
DF1
DateTime REG Flag
2018-07-01 12:00:00 NHDG 1
2018-07-12 11:55:23 NSKR 0
我目前使用的代碼是:
#Flag if in delay or not
DF1$Flag<-0
for (i in 1:nrow(DF1)){
for (j in 1:nrow(DF2)){
if ((DF1$DateTime[i] >= DF2$StartDateTime[j]) & (DF1$DateTime <= DF2$EndDateTime[j])){
DF1$Flag[i]<-1
} else {
DF1$Flag[i]<-DF1$Flag
}
}
}
如果可能,我很高興將此代碼從for循環中刪除。
如果我正確理解的話,如果DateTime在DF2
任何間隔之間,那么DF1
的Flag的值應該設置為1,對嗎? 然后,以下基本代碼將完成這項工作:
DF1$Flag = sapply(DF1$DateTime,
function(x) as.integer(sum(x >= DF2$StartDateTime &
x <= DF2$EndDateTime) > 0))
# DateTime REG Flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
想法是對比較進行矢量化處理:對於DF1
每個DateTime(通過sapply
進行“循環” sapply
),您將值與DF2
所有間隔(Start-和EndDateTime)進行比較,並對結果sum
:如果sum
大於0,則DF2
中至少有一行,其中DF1
中的DateTime介於其Start-和EndDateTime之間。 然后as.integer
將sum(...) > 0
的布爾輸出轉換為1
或0
。
而且,如果您想要更快的解決方案,請使用dplyr
:
df1 = full_join(mutate(DF1, foo=1), mutate(DF2, foo=1), by='foo') %>%
mutate(Flag = as.integer(DateTime >= StartDateTime & DateTime <= EndDateTime)) %>%
group_by(DateTime) %>% slice(which.max(Flag)) %>%
select(DateTime, REG, Flag)
否則:在DF2
的行(j循環)上,您的第二個循環似乎有問題:對於DF1
每一行,您將日期與DF2
所有行的開始日期和結束日期進行比較,基本上每次都覆蓋得到的標志值,僅保留與DF2
最后一行進行比較的結果...? 換句話說, i
在DF1$Flag[i] <- ...
不會在內部移動j
環路(並且是每個重寫時間)。
因此,如果您只想比較DF2
的最小和最大日期范圍,則可以執行以下操作:
DF1$Flag = as.integer((DF1$DateTime >= min(DF2$StartDateTime)) & (DF1$DateTime <= max(DF2$EndDateTime)))
那這個呢?
library(data.table)
DF1$flag <- as.numeric(sapply(seq(nrow(DF1)), function(x)
DF1[x, "DateTime"] %between% c(min(DF2[x, "StartDateTime"]), max(DF2[x, "EndDateTime"]))))
# DateTime REG flag
# 1 2018-07-01 12:00:00 NHDG 1
# 2 2018-07-12 11:55:23 NSKR 0
數據
> dput(DF1)
structure(list(DateTime = structure(1:2, .Label = c("2018-07-01 12:00:00",
"2018-07-12 11:55:23"), class = "factor"), REG = structure(1:2, .Label = c("NHDG",
"NSKR"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
> dput(DF2)
structure(list(StartDateTime = structure(1:2, .Label = c("2018-07-01 07:55:11",
"2018-07-12 11:00:23"), class = "factor"), EndDateTime = structure(1:2, .Label = c("2018-07-01 12:01:56",
"2018-07-12 11:45:00"), class = "factor"), Direction = structure(2:1, .Label = c("E",
"W"), class = "factor")), class = "data.frame", row.names = c(NA,
-2L))
DF1$DateTime <- as.POSIXct(DF1$DateTime)
DF2$StartDateTime <- as.POSIXct(DF2$StartDateTime)
DF2$EndDateTime <- as.POSIXct(DF2$EndDateTime)
也可能會foverlaps
:
library(data.table)
setDT(DF1)[, DateTime := as.POSIXct(DateTime)][, EndDateTime := DateTime]
setDT(DF2)[, `:=` (StartDateTime = as.POSIXct(StartDateTime),
EndDateTime = as.POSIXct (EndDateTime))]
setkey(DF1, DateTime, EndDateTime)
setkey(DF2, StartDateTime, EndDateTime)
DF1[, Flag := foverlaps(DF1, DF2, type = "within", which = TRUE, mult = "first")][
is.na(Flag), Flag := 0][, EndDateTime := NULL]
這將檢查DF1
每個日期是否位於DF2
中的任何間隔中。
至少根據我的測試,它也會很快。 sapply
基准測試:
Unit: milliseconds
expr min lq mean median uq max neval
DT 4.752853 5.247319 18.38787 5.42855 6.950966 311.1944 25
sapply 9413.337014 10598.926908 11206.14866 10892.91751 11746.901293 13568.7995 25
該數據集在DF1
具有10000行,在DF2
12行。
我只在300 000/22行上運行了一次,這就是我得到的:
Unit: seconds
expr min lq mean median uq max neval
DT 11.60865 11.60865 11.60865 11.60865 11.60865 11.60865 1
sapply 674.05823 674.05823 674.05823 674.05823 674.05823 674.05823 1
一種更快的方法是使用從tidyr到crossing df1和df2的crossing(),在新數據幀中為每行設置標志,然后使用aggregate()減少返回的行數。 此方法假定df1中沒有重復的條目。 如果有,它們將被合並。
> df1
DateTime REG
1 2018-07-01 12:00:00 NHDG
2 2018-07-12 11:55:23 NSKR
> df2
StartDateTime EndDateTime Direction
1 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a DF with rows for each combination of df1 rows with df2 rows
> tmp <- crossing(df1, df2)
> tmp
DateTime REG StartDateTime EndDateTime Direction
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E
> # Create a new column for the flag
> tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
> tmp
DateTime REG StartDateTime EndDateTime Direction flag
1 2018-07-01 12:00:00 NHDG 2018-07-01 07:55:11 2018-07-01 12:01:56 W TRUE
2 2018-07-01 12:00:00 NHDG 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
3 2018-07-12 11:55:23 NSKR 2018-07-01 07:55:11 2018-07-01 12:01:56 W FALSE
4 2018-07-12 11:55:23 NSKR 2018-07-12 11:00:23 2018-07-12 11:45:00 E FALSE
> # Drop the unwanted columns
> tmp <- tmp[,c("DateTime", "REG", "flag")]
> tmp
DateTime REG flag
1 2018-07-01 12:00:00 NHDG TRUE
2 2018-07-01 12:00:00 NHDG FALSE
3 2018-07-12 11:55:23 NSKR FALSE
4 2018-07-12 11:55:23 NSKR FALSE
> # Sum all flags for a given df1 date and limit total to 1
> df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
> df1
DateTime REG flag
1 2018-07-01 12:00:00 NHDG 1
2 2018-07-12 11:55:23 NSKR 0
>
運行更多日期,並與原始的for循環和上述sapply()方法進行比較:
Original for loop method: 6.282 sec elapsed
sapply() method: 1.65 sec elapsed
crossing() and aggregate(): 0.385 sec elapsed
完整的腳本在這里:
#!/usr/bin/env Rscript
library(tictoc)
library(tidyr)
# Setup: generate a lot of dates for performance comparison
beg <- as.POSIXct("2018-07-01 12:00:00")
end <- as.POSIXct("2100-12-01 12:00:00")
dates <- seq(beg, end, 60*60*24)
#df1 <- data.frame(c("2018-07-01 12:00:00", "2018-07-12 11:55:23"), c("NHDG","NSKR"))
df1 <- data.frame(dates, rep(c("NHDG","NSKR"), length(dates)/2))
df2 <- data.frame(c("2018-07-01 07:55:11", "2018-07-12 11:00:23"), c("2018-07-01 12:01:56", "2018-07-12 11:45:00"), c("W","E"))
colnames(df1) <- c("DateTime", "REG")
colnames(df2) <- c("StartDateTime","EndDateTime","Direction")
df1$DateTime <- as.POSIXct(df1$DateTime, tz = "America/Los_Angeles")
df2$StartDateTime <- as.POSIXct(df2$StartDateTime, tz = "America/Los_Angeles")
df2$EndDateTime <- as.POSIXct(df2$EndDateTime, tz = "America/Los_Angeles")
# Original (fixed)
tic(sprintf("%30s", "Original for loop method"))
for (i in 1:nrow(df1)){
df1$flag[i] <- 0
for (j in 1:nrow(df2)){
if ((df1$DateTime[i] >= df2$StartDateTime[j]) & (df1$DateTime[i] <= df2$EndDateTime[j])){
df1$flag[i]<-1
break
}
}
}
toc()
result1 <- df1
df1$flag <- NULL
# Sapply
tic(sprintf("%30s", "sapply() method"))
df1$flag = sapply(df1$DateTime,
function(x) as.integer(sum(x >= df2$StartDateTime &
x <= df2$EndDateTime) > 0))
toc()
result2 <- df1
df1$flag <- NULL
# Aggregate
tic(sprintf("%30s", "crossing() and aggregate()"))
# Create a DF with rows for each combination of df1 rows with df2 rows
tmp <- crossing(df1, df2)
# Create a new column for the flag
tmp$flag <- tmp$DateTime >= tmp$StartDateTime & tmp$DateTime <= tmp$EndDateTime
# Drop the unwanted columns
tmp <- tmp[,c("DateTime", "REG", "flag")]
# Sum all flags for a given df1 date and limit total to 1
df1 <- aggregate(flag ~ DateTime + REG, tmp, FUN = function(x) {min(1, sum(x))})
# Sort the rows by date
df1 <- df1[order(df1$DateTime),]
# Reset the row names (for comparison below)
rownames(df1) <- NULL
toc()
result3 <- df1
# Prove that results are the same
if (!all.equal(result1, result2)) {
print("MISMATCH")
stop()
}
if (!all.equal(result1, result3)) {
print(MISMATCH)
stop()
}
print("PASS")
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.