简体   繁体   English

R-在data.table中过滤字符日期

[英]R - Filtering character dates in data.table

The default way of reading dates in data.table with fread is that dates are stored as character values. 使用fread读取data.table日期的默认方式是将日期存储为字符值。 Using this default, I noticed that filtering in i for a date range using logical comparison versus %in% operator are dramatically different in execution time: 使用该默认值时,我注意到使用逻辑比较与%in%运算符在i范围内对日期范围进行过滤在执行时间上有很大不同:

library(data.table)

CharDateRange <- function(start.date, end.date) {
    sapply(seq(as.Date(start.date), as.Date(end.date), by="days"),
           function (x) format(x, "%Y-%m-%d"))
}

# define a range of dates, represented by a character vector
range.dates <- CharDateRange("2015-01-01", "2015-01-31")

# create example data table
nrows <- 1e7
DT <- data.table(date = sample(range.dates, nrows, replace=T),
                 value = runif(nrows))

The %in% operation is much faster than logical comparison: %in%操作比逻辑比较快得多:

print(system.time(DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]))
> user  system elapsed 
0.238   0.017   0.254 

and

print(system.time(DT[date >= "2015-01-10" & date <= "2015-01-17"]))
> user  system elapsed 
6.693   0.018   6.711

Could you please explain why this is so? 你能解释为什么会这样吗?

This is to be expected and is not related to data.table or dates: 这是预料之中的,与data.table或日期无关:

 myvec <- rep(c("111111","999999"),1e7)
 mycompvec <- as.character(111111:999999)

 system.time(myvec%in%mycompvec)
 #   user  system elapsed 
 #   1.39    0.08    1.49 
system.time(myvec<="999999"&myvec>="111111")
#    user  system elapsed 
#    9.92    0.03   10.03 

Also should point out that it will be even faster to use keys (about 17% improvement, not as dramatic as I would have expected): 还应该指出的是,使用密钥的速度将会更快(大约提高了17%,不像我预期的那样剧烈):

DT <- data.table(date = sample(range.dates, nrows, replace=T),
                 value = runif(nrows),key="date")

microbenchmark(times=10,
               DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
               DT[date >= "2015-01-10" & date <= "2015-01-17"],
               DT[.(CharDateRange("2015-01-10", "2015-01-17"))])
Unit: milliseconds
                                                    expr        min         lq       mean     median         uq        max neval cld
 DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]   30.17786   30.90273   33.29402   31.71152   31.99111   42.29018    10  a 
         DT[date >= "2015-01-10" & date <= "2015-01-17"] 4825.18913 4842.19703 4855.27402 4846.98401 4861.02841 4926.22591    10   b
        DT[.(CharDateRange("2015-01-10", "2015-01-17"))]   26.15394   26.77365   30.34439   28.14887   34.97858   35.95498    10  a 

The bigger improvement, I found, is to work with dates directly ( especially for using the inequalities comparisons, though they're still much slower, for the reasons @Frank pointed out): 我发现,更大的改进是直接处理日期( 特别是使用不等式比较,尽管由于@Frank指出的原因,它们仍然慢得多):

DT2 <- data.table(date=sample(seq(from=as.Date("2015-01-01"),
                                  to=as.Date("2015-01-31"),by="day"),
                              nrows,replace=T),value=runif(nrows),key="date")
microbenchmark(times=10,
               DT[date %in% CharDateRange("2015-01-10", "2015-01-17")],
               DT[date >= "2015-01-10" & date <= "2015-01-17"],
               DT[.(CharDateRange("2015-01-10", "2015-01-17"))],
               DT2[date %in% seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day")],
               DT2[date>="2015-01-10"&date<="2015-01-17"],
               DT2[.(seq(from=as.Date("2015-01-10"),to=as.Date("2015-01-17"),by="day"))])
Unit: milliseconds
                                                                                          expr        min         lq       mean     median         uq        max neval
                                       DT[date %in% CharDateRange("2015-01-10", "2015-01-17")]   30.22378   31.17341   32.56766   32.11701   33.53306   37.03804    10
                                               DT[date >= "2015-01-10" & date <= "2015-01-17"] 4856.15109 4877.55814 4952.64332 4910.17639 4952.12055 5337.04256    10
                                              DT[.(CharDateRange("2015-01-10", "2015-01-17"))]   27.32360   27.82355   28.69142   28.74196   29.27730   30.31997    10
 DT2[date %in% seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"),      by = "day")]   23.32938   24.44665   26.11454   25.05308   26.34364   36.58792    10
                                              DT2[date >= "2015-01-10" & date <= "2015-01-17"]  264.96633  272.44326  276.98355  277.07129  279.22478  291.16967    10
        DT2[.(seq(from = as.Date("2015-01-10"), to = as.Date("2015-01-17"),      by = "day"))]   18.89304   20.83852   20.85754   20.89787   21.05545   21.76082    10

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM