简体   繁体   中英

R data.table set new column with logical value if a weekday is between a date range

I have a data.table object with two date columns, from and to . I want to create a new column to determine if a specific weekday is in between the date range.

[Data]

library(data.table)
set.seed(1)
DT <- data.table(from=seq.Date(Sys.Date(), Sys.Date()+100, by="day"))[, to:=from+sample(10, 1), by=1:nrow(DT)][, from_wd:=wday(from)][, to_wd:=wday(to)]

> head(DT)
         from         to from_wd to_wd
1: 2015-08-06 2015-08-10       5     2
2: 2015-08-07 2015-08-10       6     2
3: 2015-08-08 2015-08-18       7     3
4: 2015-08-09 2015-08-16       1     1
5: 2015-08-10 2015-08-13       2     5
6: 2015-08-11 2015-08-13       3     5

[My Approach]

In this case, I want to add a new boolean column flag , which returns TRUE if Wednesday is in the range of [from, to] .

This is my attempt:

DT[, flag:=0][DT[, .I[4 %in% unique(wday(seq.Date(from, to, by="day")))], by=1:nrow(DT)][[1]], flag:=1]

> table(DT$flag)

 0  1 
21 80 

[Question]

The code took some time to run, and as you can imagine, it will take more time if nrow(DT) gets larger.

My question is: Is there a better way to do this? Better in terms of speed and code readability (I believe my code is not intuitive at all).

Here's one approach:

next_wday <- function(d,wd=4L){
    wddiff = wd - wday(d)
    d + wddiff + (wddiff < 0L)*7L
} 


DT[, flag2 := +(next_wday(from) <= to)]

# test:
DT[,table(flag,flag2)]
#     flag2
# flag  0  1
#    0 44  0
#    1  0 57

The idea is that you compare to against the next Thursday**. The replacement line could be written a number of different ways.

Benchmark

The OP mentioned that from and to could be up to 200 days apart so...

set.seed(1)
from <- seq(as.IDate("1950-01-01"), by = "day", length = 1e6)
to   <- from + pmin(200,rpois(length(from),1))
DT   <- data.table(from,to)

system.time(DT[, flag2 := +(next_wday(from) <= to)])
#    user  system elapsed 
#    2.11    0.03    2.14

# David Arenburg's solution
system.time({
    DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                list(from = temp2, to = temp2)
               }
             ]
    indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid
    DT[, flag := 0L][indx, flag := 1L]
})  
#    user  system elapsed 
#    6.75    0.14    6.89

# check agreement
DT[,table(flag,flag2)]
#     flag2
# flag      0      1
#    0 714666      0
#    1      0 285334

I'm using IDate because it is the date format that comes with the data.table package and is (?) faster to work with. There are a couple of ways one could make the code even faster:

  • First, it might be faster to restrict attention to rows where to-from is less than 6 (since any gap 6 or greater will have every weekday), like

     DT[,flag2:=0L][to-from < 6, flag2 := +(next_wday(from) <= to)] 
  • Second, because the computation only depends on one row at a time, parallelization may lead to some improvement, as illustrated in @grubjesic's answer.

  • Depending on the data on one's real data, additional improvements might be found.

The OP's code isn't benchmarked here because it entails splitting the data by rows and enumerating up to 200 dates per row, which will certainly be slow.


** or whatever wday being 4 means.

You could also try the foverlaps approach

First will create data set of all the Wednesday starting from min(from) and ending at max(to)

DateDT <- DT[, {
                temp <- seq(min(from), max(to), by = "day")
                temp2 <- temp[wday(temp) == 4L]
                .(from = temp2, to = temp2)
               }
             ]

Then run foverlaps and extract desired rows

indx <- foverlaps(DT, setkey(DateDT), nomatch = 0L, which = TRUE)$xid

Then a simple update by reference will do

DT[, flag := 0L][indx, flag := 1L]
DT[, table(flag)]
#  0  1 
# 44 57 

Here's my example:

library(parallel)

process <- function(){


  from <- seq(as.Date("1950-01-01"), by = "day", length = 100000)
  to <- seq(as.Date("1950-01-04"), by = "day", length = 100000)

  DT <- data.frame(from,to)

  Ncores <- detectCores()

  flagList <- mclapply(1:nrow(DT),function(id){

    4 %in% strftime(seq(as.Date(DT[id,1]), as.Date(DT[id,2]), by="day"), format="%w")

  },mc.cores=Ncores)

  flag <- unlist(flagList)

  return(cbind(DT,flag))

}

It takes just 15 sec for 100k rows on my i7 processor. Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM