R data.table如果工作日在日期范围之间，则使用逻辑值设置新列

Question

I have a data.table object with two date columns, from and to . 我有一个data.table有两个目标date栏， from和to 。 I want to create a new column to determine if a specific weekday is in between the date range. 我想创建一个新列，以确定某个特定的工作日是否在日期范围之间。

[Data] [数据]

library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]

> head(DT)
         from         to from_wd to_wd
1: 2015-08-06 2015-08-10       5     2
2: 2015-08-07 2015-08-10       6     2
3: 2015-08-08 2015-08-18       7     3
4: 2015-08-09 2015-08-16       1     1
5: 2015-08-10 2015-08-13       2     5
6: 2015-08-11 2015-08-13       3     5

[My Approach] [我的方法]

In this case, I want to add a new boolean column flag , which returns TRUE if Wednesday is in the range of [from, to] . 在这种情况下，我想添加一个新的boolean列flag ，如果星期三在[from, to]的范围内，则返回TRUE 。

This is my attempt: 这是我的尝试：

DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]

> table(DT$flag)

 0  1 
21 80

[Question] [题]

The code took some time to run, and as you can imagine, it will take more time if nrow(DT) gets larger. 该代码需要花费一些时间才能运行，并且可以想象，如果nrow(DT)变大，它将花费更多时间。

My question is: Is there a better way to do this? 我的问题是： 还有更好的方法吗？ Better in terms of speed and code readability (I believe my code is not intuitive at all). 在速度和代码可读性方面更好（我相信我的代码根本不直观）。

Answer 1

Here's one approach: 这是一种方法：

next_wday <- function(d,wd=4L){
    wddiff = wd - wday(d)
    d + wddiff + (wddiff < 0L)*7L
} 


DT[, flag2 := +(next_wday(from) <= to)]

# test:
DT[,table(flag,flag2)]
#     flag2
# flag  0  1
#    0 44  0
#    1  0 57

The idea is that you compare to against the next Thursday**. 这个想法是，你比较to对下周四**。 The replacement line could be written a number of different ways. 替换行可以用多种不同的方式编写。

Benchmark 基准

The OP mentioned that from and to could be up to 200 days apart so... OP指出， from to可能最多相隔200天，所以...

set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to   <- from + pmin(200,rpois(length(from),1))
DT   <- data.table(from,to)

system.time(DT[, flag2 := +(next_wday(from) <= to)])
#    user  system elapsed 
#    2.11    0.03    2.14

# David Arenburg's solution
system.time({
    DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                list(from = temp2, to = temp2)
               }
             ]
    indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
    DT[, flag := 0L][indx, flag := 1L]
})  
#    user  system elapsed 
#    6.75    0.14    6.89

# check agreement
DT[,table(flag,flag2)]
#     flag2
# flag      0      1
#    0 714666      0
#    1      0 285334

I'm using IDate because it is the date format that comes with the data.table package and is (?) faster to work with. 我使用IDate是因为它是data.table包随附的日期格式，并且使用（？）的速度更快。 There are a couple of ways one could make the code even faster: 有两种方法可以使代码更快：

First, it might be faster to restrict attention to rows where to-from is less than 6 (since any gap 6 or greater will have every weekday), like 首先，将注意力限制在to-from小于6的行上可能更快（因为每个工作日的差距为6或更大）
```
 DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)] 
```
Second, because the computation only depends on one row at a time, parallelization may lead to some improvement, as illustrated in @grubjesic's answer. 其次，由于计算一次仅取决于一行，因此并行化可以带来一些改进，如@grubjesic的答案所示。
Depending on the data on one's real data, additional improvements might be found. 根据实际数据的不同，可能会发现其他改进。

The OP's code isn't benchmarked here because it entails splitting the data by rows and enumerating up to 200 dates per row, which will certainly be slow. OP的代码未在此处进行基准测试，因为它需要按行划分数据并每行最多枚举200个日期，这肯定会很慢。

** or whatever wday being 4 means. **或任何wday为4表示。

Answer 2

You could also try the foverlaps approach 您也可以尝试使用foverlaps法

First will create data set of all the Wednesday starting from min(from) and ending at max(to) 首先将创建从min(from)到max(to)结束的所有星期三的数据集

DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                .(from = temp2, to = temp2)
               }
             ]

Then run foverlaps and extract desired rows 然后运行foverlaps并提取所需的行

indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid

Then a simple update by reference will do 然后通过引用进行简单更新

DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
#  0  1 
# 44 57

Answer 3

Here's my example: 这是我的示例：

library(parallel)

process <- function(){


  from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
  to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)

  DT <- data.frame(from,to)

  Ncores <- detectCores()

  flagList <- mclapply(1:nrow(DT),function(id){

    4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")

  },mc.cores=Ncores)

  flag <- unlist(flagList)

  return(cbind(DT,flag))

}

It takes just 15 sec for 100k rows on my i7 processor. i7处理器上的100k行仅需15秒。 Hope this helps. 希望这可以帮助。

R data.table如果工作日在日期范围之间，则使用逻辑值设置新列

问题描述

3 个解决方案

解决方案1
3 已采纳 2015-08-05 21:15:37

解决方案2
2 2015-08-05 21:20:30

解决方案3
0 2015-08-05 21:15:23

R data.table如果工作日在日期范围之间，则使用逻辑值设置新列

问题描述

3 个解决方案

解决方案1 3 已采纳 2015-08-05 21:15:37

解决方案2 2 2015-08-05 21:20:30

解决方案3 0 2015-08-05 21:15:23

解决方案1
3 已采纳 2015-08-05 21:15:37

解决方案2
2 2015-08-05 21:20:30

解决方案3
0 2015-08-05 21:15:23