[英]R - Filtering character dates in data.table
The default way of reading dates in data.table
with fread
is that dates are stored as character values. 使用
fread
读取data.table
日期的默认方式是将日期存储为字符值。 Using this default, I noticed that filtering in i
for a date range using logical comparison versus %in%
operator are dramatically different in execution time: 使用该默认值时,我注意到使用逻辑比较与
%in%
运算符在i
范围内对日期范围进行过滤在执行时间上有很大不同:
library(data.table)
CharDateRange <- function(start.date, end.date) {
sapply(seq(as.Date(start.date), as.Date(end.date), by="days"),
function (x) format(x, "%Y-%m-%d"))
}
# define a range of dates, represented by a character vector
range.dates <- CharDateRange("2015-01-01", "2015-01-31")
# create example data table
nrows <- 1e7
DT <- data.table(date = sample(range.dates, nrows, replace=T),
value = runif(nrows))
The %in%
operation is much faster than logical comparison: %in%
操作比逻辑比较快得多:
print(system.time(DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]))
> user system elapsed
0.238 0.017 0.254
and 和
print(system.time(DT[date >= "2015-01-10" & date <= "2015-01-17"]))
> user system elapsed
6.693 0.018 6.711
Could you please explain why this is so? 你能解释为什么会这样吗?
This is to be expected and is not related to data.table
or dates: 这是预料之中的,与
data.table
或日期无关:
myvec <- rep(c("111111","999999"),1e7)
mycompvec <- as.character(111111:999999)
system.time(myvec%in%mycompvec)
# user system elapsed
# 1.39 0.08 1.49
system.time(myvec<="999999"&myvec>="111111")
# user system elapsed
# 9.92 0.03 10.03
Also should point out that it will be even faster to use keys (about 17% improvement, not as dramatic as I would have expected): 还应该指出的是,使用密钥的速度将会更快(大约提高了17%,不像我预期的那样剧烈):
DT <- data.table(date = sample(range.dates, nrows, replace=T),
value = runif(nrows),key="date")
microbenchmark(times=10,
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
DT[date >= "2015-01-10" & date <= "2015-01-17"],
DT[.(CharDateRange("2015-01-10", "2015-01-17"))])
Unit: milliseconds
expr min lq mean median uq max neval cld
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")] 30.17786 30.90273 33.29402 31.71152 31.99111 42.29018 10 a
DT[date >= "2015-01-10" & date <= "2015-01-17"] 4825.18913 4842.19703 4855.27402 4846.98401 4861.02841 4926.22591 10 b
DT[.(CharDateRange("2015-01-10", "2015-01-17"))] 26.15394 26.77365 30.34439 28.14887 34.97858 35.95498 10 a
The bigger improvement, I found, is to work with dates directly ( especially for using the inequalities comparisons, though they're still much slower, for the reasons @Frank pointed out): 我发现,更大的改进是直接处理日期( 特别是使用不等式比较,尽管由于@Frank指出的原因,它们仍然慢得多):
DT2 <- data.table(date=sample(seq(from=as.Date("2015-01-01"),
to=as.Date("2015-01-31"),by="day"),
nrows,replace=T),value=runif(nrows),key="date")
microbenchmark(times=10,
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
DT[date >= "2015-01-10" & date <= "2015-01-17"],
DT[.(CharDateRange("2015-01-10", "2015-01-17"))],
DT2[date %in% seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day")],
DT2[date>="2015-01-10"&date<="2015-01-17"],
DT2[.(seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day"))])
Unit: milliseconds
expr min lq mean median uq max neval
DT[date %in% CharDateRange("2015-01-10", "2015-01-17")] 30.22378 31.17341 32.56766 32.11701 33.53306 37.03804 10
DT[date >= "2015-01-10" & date <= "2015-01-17"] 4856.15109 4877.55814 4952.64332 4910.17639 4952.12055 5337.04256 10
DT[.(CharDateRange("2015-01-10", "2015-01-17"))] 27.32360 27.82355 28.69142 28.74196 29.27730 30.31997 10
DT2[date %in% seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"), by = "day")] 23.32938 24.44665 26.11454 25.05308 26.34364 36.58792 10
DT2[date >= "2015-01-10" & date <= "2015-01-17"] 264.96633 272.44326 276.98355 277.07129 279.22478 291.16967 10
DT2[.(seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"), by = "day"))] 18.89304 20.83852 20.85754 20.89787 21.05545 21.76082 10
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.