[英]R data.table set new column with logical value if a weekday is between a date range
I have a data.table
object with two date
columns, from
and to
. 我有一个
data.table
有两个目标date
栏, from
和to
。 I want to create a new column to determine if a specific weekday is in between the date range. 我想创建一个新列,以确定某个特定的工作日是否在日期范围之间。
[Data] [数据]
library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]
> head(DT)
from to from_wd to_wd
1: 2015-08-06 2015-08-10 5 2
2: 2015-08-07 2015-08-10 6 2
3: 2015-08-08 2015-08-18 7 3
4: 2015-08-09 2015-08-16 1 1
5: 2015-08-10 2015-08-13 2 5
6: 2015-08-11 2015-08-13 3 5
[My Approach] [我的方法]
In this case, I want to add a new boolean
column flag
, which returns TRUE
if Wednesday is in the range of [from, to]
. 在这种情况下,我想添加一个新的
boolean
列flag
,如果星期三在[from, to]
的范围内,则返回TRUE
。
This is my attempt: 这是我的尝试:
DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]
> table(DT$flag)
0 1
21 80
[Question] [题]
The code took some time to run, and as you can imagine, it will take more time if nrow(DT)
gets larger. 该代码需要花费一些时间才能运行,并且可以想象,如果
nrow(DT)
变大,它将花费更多时间。
My question is: Is there a better way to do this? 我的问题是: 还有更好的方法吗? Better in terms of speed and code readability (I believe my code is not intuitive at all).
在速度和代码可读性方面更好(我相信我的代码根本不直观)。
Here's one approach: 这是一种方法:
next_wday <- function(d,wd=4L){
wddiff = wd - wday(d)
d + wddiff + (wddiff < 0L)*7L
}
DT[, flag2 := +(next_wday(from) <= to)]
# test:
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 44 0
# 1 0 57
The idea is that you compare to
against the next Thursday**. 这个想法是,你比较
to
对下周四**。 The replacement line could be written a number of different ways. 替换行可以用多种不同的方式编写。
Benchmark 基准
The OP mentioned that from
and to
could be up to 200 days apart so... OP指出,
from
to
可能最多相隔200天,所以...
set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to <- from + pmin(200,rpois(length(from),1))
DT <- data.table(from,to)
system.time(DT[, flag2 := +(next_wday(from) <= to)])
# user system elapsed
# 2.11 0.03 2.14
# David Arenburg's solution
system.time({
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
list(from = temp2, to = temp2)
}
]
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
DT[, flag := 0L][indx, flag := 1L]
})
# user system elapsed
# 6.75 0.14 6.89
# check agreement
DT[,table(flag,flag2)]
# flag2
# flag 0 1
# 0 714666 0
# 1 0 285334
I'm using IDate
because it is the date format that comes with the data.table package and is (?) faster to work with. 我使用
IDate
是因为它是data.table包随附的日期格式,并且使用(?)的速度更快。 There are a couple of ways one could make the code even faster: 有两种方法可以使代码更快:
First, it might be faster to restrict attention to rows where to-from
is less than 6 (since any gap 6 or greater will have every weekday), like 首先,将注意力限制在
to-from
小于6的行上可能更快(因为每个工作日的差距为6或更大)
DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)]
Second, because the computation only depends on one row at a time, parallelization may lead to some improvement, as illustrated in @grubjesic's answer. 其次,由于计算一次仅取决于一行,因此并行化可以带来一些改进,如@grubjesic的答案所示。
Depending on the data on one's real data, additional improvements might be found. 根据实际数据的不同,可能会发现其他改进。
The OP's code isn't benchmarked here because it entails splitting the data by rows and enumerating up to 200 dates per row, which will certainly be slow. OP的代码未在此处进行基准测试,因为它需要按行划分数据并每行最多枚举200个日期,这肯定会很慢。
** or whatever wday
being 4 means. **或任何
wday
为4表示。
You could also try the foverlaps
approach 您也可以尝试使用
foverlaps
法
First will create data set of all the Wednesday starting from min(from)
and ending at max(to)
首先将创建从
min(from)
到max(to)
结束的所有星期三的数据集
DateDT <- DT[, {
temp <- seq(min(from), max(to), by = "day")
temp2 <- temp[wday(temp) == 4L]
.(from = temp2, to = temp2)
}
]
Then run foverlaps
and extract desired rows 然后运行
foverlaps
并提取所需的行
indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
Then a simple update by reference will do 然后通过引用进行简单更新
DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
# 0 1
# 44 57
Here's my example: 这是我的示例:
library(parallel)
process <- function(){
from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)
DT <- data.frame(from,to)
Ncores <- detectCores()
flagList <- mclapply(1:nrow(DT),function(id){
4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")
},mc.cores=Ncores)
flag <- unlist(flagList)
return(cbind(DT,flag))
}
It takes just 15 sec for 100k rows on my i7 processor. i7处理器上的100k行仅需15秒。 Hope this helps.
希望这可以帮助。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.