简体   繁体   English

创建一个新的日期变量,它与 r 中的原始日期变量位于同一周的同一天、同一个月和同一年

[英]Creating a new date variable that is on the same day of the week, within the same month, and year as original date variable in r

I need to create a new variable "controldates" from a date variable "casedates".我需要从日期变量“casedates”创建一个新变量“controldates”。 This new variable is going to consist of dates that are on the same day of the week as the casedate, within the same month and year as the case date.这个新变量将由与 casedate 在同一周的同一天,在与 case date 相同的月份和年份内的日期组成。 For example if I have a case date on the 3rd Wednesday of July my control days will be the first 1st Wednesday of July, the second Wednesday of July, and the 4th Wednesday of July.例如,如果我的案例日期是 7 月的第 3 个星期三,我的控制日将是 7 月的第一个星期三、7 月的第二个星期三和 7 月的第 4 个星期三。 Additionally, I would like to create an indicator variable for each group of dates that are created.此外,我想为创建的每组日期创建一个指标变量。 I would like to do this using dplyr in r.我想在 r 中使用 dplyr 来做到这一点。

Starting data:起始数据:

Casedate
 "01-03-2015"
 "08-27-2017"
 "10-23-2019"

This is how I would like it to look这就是我想要的样子

Casedate          Controldate      Index
"01-03-2015"      "01-03-2015"       1
"01-03-2015"      "01-10-2015"       1
"01-03-2015"      "01-17-2015"       1
"01-03-2015"      "01-24-2015"       1
"01-03-2015"      "01-31-2015"       1
"08-12-2017"      "08-05-2017"       2
"08-12-2017"      "08-12-2017"       2
"08-12-2017"      "08-19-2017"       2
"08-12-2017"      "08-26-2017"       2
"10-23-2019"      "10-02-2019"       3
"10-23-2019"      "10-09-2019"       3
"10-23-2019"      "10-16-2019"       3
"10-23-2019"      "10-23-2019"       3
"10-23-2019"      "10-30-2019"       3

Here is an option with tidyverse .这是tidyverse一个选项。 Convert the 'Casedate' to Date class with lubridate , then loop over the elements with map , create a seq uence of dates in a list , unnest the list column转换的“Casedate”到Datelubridate ,然后在与元素循环map ,创建一个seq的日期uence listunnestlist

library(dplyr)
library(purrr)
library(lubridate)
df1 %>% 
   mutate(Index = row_number(), 
      Casedate = mdy(Casedate), 
     wd = wday(Casedate, label = TRUE), 
     Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
   x1 <- seq(.x, length.out = 7, by = '1 day')
    seq(x1[wday(x1, label = TRUE) == .y],
       ceiling_date(.x, 'month'), by = '7 day')})) %>% 
    unnest(c(Controldate)) %>%
    select(Casedate, Controldate, Index)

-output -输出

# A tibble: 14 x 3
#   Casedate   Controldate Index
#   <date>     <date>      <int>
# 1 2015-01-03 2015-01-03      1
# 2 2015-01-03 2015-01-10      1
# 3 2015-01-03 2015-01-17      1
# 4 2015-01-03 2015-01-24      1
# 5 2015-01-03 2015-01-31      1
# 6 2017-08-27 2017-08-06      2
# 7 2017-08-27 2017-08-13      2
# 8 2017-08-27 2017-08-20      2
# 9 2017-08-27 2017-08-27      2
#10 2019-10-23 2019-10-02      3
#11 2019-10-23 2019-10-09      3
#12 2019-10-23 2019-10-16      3
#13 2019-10-23 2019-10-23      3
#14 2019-10-23 2019-10-30      3

data数据

df1 <- structure(list(Casedate = c("01-03-2015", "08-27-2017", "10-23-2019"
)), class = "data.frame", row.names = c(NA, -3L))

Since there can only at most be 4 weeks prior or 4 weeks after a date within a month (9 values total), you can get away with calculating that range all in one go with some sequences.由于一个月内最多只能有 4 周或 4 周后的日期(总共 9 个值),因此您可以通过一些序列一次性计算该范围。 That should avoid the need for looping over every value explicitly.这应该避免需要显式循环每个值。

After calculating the values, then subset to those in the same month as the original value in a single sweep.计算完这些值后,然后在一次扫描中将其子集到与原始值相同的月份。 Using @akrun's df1 example data from below:使用@akrun 的df1示例数据来自下面:

d  <- as.Date(df1$Casedate, format="%m-%d-%Y")
r  <- rep(d, each=9)
o  <- r + (7 * -4:4)
i  <- rep(seq_along(d), each=9)
s  <- format(o, "%m") == format(r, "%m")

data.frame(
    Casedate = r,
    Controldate = o,
    Index = i
)[s,]

#     Casedate Controldate Index
#5  2015-01-03  2015-01-03     1
#6  2015-01-03  2015-01-10     1
#7  2015-01-03  2015-01-17     1
#8  2015-01-03  2015-01-24     1
#9  2015-01-03  2015-01-31     1
#11 2017-08-27  2017-08-06     2
#12 2017-08-27  2017-08-13     2
#13 2017-08-27  2017-08-20     2
#14 2017-08-27  2017-08-27     2
#20 2019-10-23  2019-10-02     3
#21 2019-10-23  2019-10-09     3
#22 2019-10-23  2019-10-16     3
#23 2019-10-23  2019-10-23     3
#24 2019-10-23  2019-10-30     3

If you want to keep all of the original variables in the dataset, it is a simple fix:如果你想保留数据集中的所有原始变量,这是一个简单的修复:

cbind(
  df1[i,],
  data.frame(Controldate = o, Index = i)
)[s,]

Eg:例如:

#      Casedate othvar1 othvar2 Controldate Index
#1.4 01-03-2015       a       B  2015-01-03     1
#1.5 01-03-2015       a       B  2015-01-10     1
#1.6 01-03-2015       a       B  2015-01-17     1
#1.7 01-03-2015       a       B  2015-01-24     1
#...

Even on a moderately large dataset (300K rows), there is a meaningful difference in timing between generating sequence runs (2 seconds) and looping over each value (2 minutes):即使在中等大的数据集(300K 行)上,生成序列运行(2 秒)和循环每个值(2 分钟)之间的时间也存在有意义的差异:

Sequence:序列:

df1 <- df1[rep(1:3,each=1e5),,drop=FALSE]

system.time({
d  <- as.Date(df1$Casedate, format="%m-%d-%Y")
r  <- rep(d, each=9)
o  <- r + (7 * -4:4)
i  <- rep(seq_along(d), each=9)
s  <- format(o, "%m") == format(r, "%m")

data.frame(
    Casedate = r,
    Controldate = o,
    Index = i
)[s,]
})

#   user  system elapsed 
#  1.909   0.128   2.038 

Looping:循环:

library(dplyr)
library(purrr)
library(lubridate)

system.time({
df1 %>% 
   mutate(Index = row_number(), 
      Casedate = mdy(Casedate), 
     wd = wday(Casedate, label = TRUE), 
     Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
   x1 <- seq(.x, length.out = 7, by = '1 day')
    seq(x1[wday(x1, label = TRUE) == .y],
       ceiling_date(.x, 'month'), by = '7 day')})) %>% 
    unnest(Controldate) %>%
    select(Casedate, Controldate, Index)
})

#    user  system elapsed 
# 131.466   1.143 132.623

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM