[英]Creating a new date variable that is on the same day of the week, within the same month, and year as original date variable in r
I need to create a new variable "controldates" from a date variable "casedates".我需要从日期变量“casedates”创建一个新变量“controldates”。 This new variable is going to consist of dates that are on the same day of the week as the casedate, within the same month and year as the case date.这个新变量将由与 casedate 在同一周的同一天,在与 case date 相同的月份和年份内的日期组成。 For example if I have a case date on the 3rd Wednesday of July my control days will be the first 1st Wednesday of July, the second Wednesday of July, and the 4th Wednesday of July.例如,如果我的案例日期是 7 月的第 3 个星期三,我的控制日将是 7 月的第一个星期三、7 月的第二个星期三和 7 月的第 4 个星期三。 Additionally, I would like to create an indicator variable for each group of dates that are created.此外,我想为创建的每组日期创建一个指标变量。 I would like to do this using dplyr in r.我想在 r 中使用 dplyr 来做到这一点。
Starting data:起始数据:
Casedate
"01-03-2015"
"08-27-2017"
"10-23-2019"
This is how I would like it to look这就是我想要的样子
Casedate Controldate Index
"01-03-2015" "01-03-2015" 1
"01-03-2015" "01-10-2015" 1
"01-03-2015" "01-17-2015" 1
"01-03-2015" "01-24-2015" 1
"01-03-2015" "01-31-2015" 1
"08-12-2017" "08-05-2017" 2
"08-12-2017" "08-12-2017" 2
"08-12-2017" "08-19-2017" 2
"08-12-2017" "08-26-2017" 2
"10-23-2019" "10-02-2019" 3
"10-23-2019" "10-09-2019" 3
"10-23-2019" "10-16-2019" 3
"10-23-2019" "10-23-2019" 3
"10-23-2019" "10-30-2019" 3
Here is an option with tidyverse
.这是tidyverse
一个选项。 Convert the 'Casedate' to Date
class with lubridate
, then loop over the elements with map
, create a seq
uence of dates in a list
, unnest
the list
column转换的“Casedate”到Date
类lubridate
,然后在与元素循环map
,创建一个seq
的日期uence list
, unnest
的list
列
library(dplyr)
library(purrr)
library(lubridate)
df1 %>%
mutate(Index = row_number(),
Casedate = mdy(Casedate),
wd = wday(Casedate, label = TRUE),
Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
x1 <- seq(.x, length.out = 7, by = '1 day')
seq(x1[wday(x1, label = TRUE) == .y],
ceiling_date(.x, 'month'), by = '7 day')})) %>%
unnest(c(Controldate)) %>%
select(Casedate, Controldate, Index)
-output -输出
# A tibble: 14 x 3
# Casedate Controldate Index
# <date> <date> <int>
# 1 2015-01-03 2015-01-03 1
# 2 2015-01-03 2015-01-10 1
# 3 2015-01-03 2015-01-17 1
# 4 2015-01-03 2015-01-24 1
# 5 2015-01-03 2015-01-31 1
# 6 2017-08-27 2017-08-06 2
# 7 2017-08-27 2017-08-13 2
# 8 2017-08-27 2017-08-20 2
# 9 2017-08-27 2017-08-27 2
#10 2019-10-23 2019-10-02 3
#11 2019-10-23 2019-10-09 3
#12 2019-10-23 2019-10-16 3
#13 2019-10-23 2019-10-23 3
#14 2019-10-23 2019-10-30 3
df1 <- structure(list(Casedate = c("01-03-2015", "08-27-2017", "10-23-2019"
)), class = "data.frame", row.names = c(NA, -3L))
Since there can only at most be 4 weeks prior or 4 weeks after a date within a month (9 values total), you can get away with calculating that range all in one go with some sequences.由于一个月内最多只能有 4 周或 4 周后的日期(总共 9 个值),因此您可以通过一些序列一次性计算该范围。 That should avoid the need for looping over every value explicitly.这应该避免需要显式循环每个值。
After calculating the values, then subset to those in the same month as the original value in a single sweep.计算完这些值后,然后在一次扫描中将其子集到与原始值相同的月份。 Using @akrun's df1
example data from below:使用@akrun 的df1
示例数据来自下面:
d <- as.Date(df1$Casedate, format="%m-%d-%Y")
r <- rep(d, each=9)
o <- r + (7 * -4:4)
i <- rep(seq_along(d), each=9)
s <- format(o, "%m") == format(r, "%m")
data.frame(
Casedate = r,
Controldate = o,
Index = i
)[s,]
# Casedate Controldate Index
#5 2015-01-03 2015-01-03 1
#6 2015-01-03 2015-01-10 1
#7 2015-01-03 2015-01-17 1
#8 2015-01-03 2015-01-24 1
#9 2015-01-03 2015-01-31 1
#11 2017-08-27 2017-08-06 2
#12 2017-08-27 2017-08-13 2
#13 2017-08-27 2017-08-20 2
#14 2017-08-27 2017-08-27 2
#20 2019-10-23 2019-10-02 3
#21 2019-10-23 2019-10-09 3
#22 2019-10-23 2019-10-16 3
#23 2019-10-23 2019-10-23 3
#24 2019-10-23 2019-10-30 3
If you want to keep all of the original variables in the dataset, it is a simple fix:如果你想保留数据集中的所有原始变量,这是一个简单的修复:
cbind(
df1[i,],
data.frame(Controldate = o, Index = i)
)[s,]
Eg:例如:
# Casedate othvar1 othvar2 Controldate Index
#1.4 01-03-2015 a B 2015-01-03 1
#1.5 01-03-2015 a B 2015-01-10 1
#1.6 01-03-2015 a B 2015-01-17 1
#1.7 01-03-2015 a B 2015-01-24 1
#...
Even on a moderately large dataset (300K rows), there is a meaningful difference in timing between generating sequence runs (2 seconds) and looping over each value (2 minutes):即使在中等大的数据集(300K 行)上,生成序列运行(2 秒)和循环每个值(2 分钟)之间的时间也存在有意义的差异:
Sequence:序列:
df1 <- df1[rep(1:3,each=1e5),,drop=FALSE]
system.time({
d <- as.Date(df1$Casedate, format="%m-%d-%Y")
r <- rep(d, each=9)
o <- r + (7 * -4:4)
i <- rep(seq_along(d), each=9)
s <- format(o, "%m") == format(r, "%m")
data.frame(
Casedate = r,
Controldate = o,
Index = i
)[s,]
})
# user system elapsed
# 1.909 0.128 2.038
Looping:循环:
library(dplyr)
library(purrr)
library(lubridate)
system.time({
df1 %>%
mutate(Index = row_number(),
Casedate = mdy(Casedate),
wd = wday(Casedate, label = TRUE),
Controldate = map2(floor_date(Casedate, 'month'), wd, ~ {
x1 <- seq(.x, length.out = 7, by = '1 day')
seq(x1[wday(x1, label = TRUE) == .y],
ceiling_date(.x, 'month'), by = '7 day')})) %>%
unnest(Controldate) %>%
select(Casedate, Controldate, Index)
})
# user system elapsed
# 131.466 1.143 132.623
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.