![](/img/trans.png)
[英]How do I add a column to data frame with a letter representing a date range in another column
[英]How do I find if a date in the first DF falls within the range of dates in another data frame?
这有点棘手。 我有两个数据框,需要帮助创建 function 或某种循环来帮助我确定 data.frame x 中的值是否介于 data.frame y 中的两列值之间。
因此,例如:
数据帧 x:
x <- structure(list(ID = c(1L, 1L, 3L, 2L, 2L), GroupID = c(45L,65L, 45L, 65L,45L), DateStart = c("2/11/2021",
"2/14/2021", "2/10/2021, "2/16/2021","2/19/2021"), DateEnd = c("2/13/2021",
"2/15/2021", "2/14/2021","2/18/2021", "2/22/2021")),
class = "data.frame", row.names = c(NA, -4L))
x
ID GroupID DateStart DateEnd
1 1 45 2/11/2021 2/13/2021
2 1 65 2/14/2021 2/15/2021
3 3 45 2/10/2021 2/14/2021
4 2 65 2/16/2021 2/18/2021
5 2 45 2/19/2021 2/22/2021
然后是
y <- structure(list(ID = c(1L, 1L, 1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
2L 2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,
3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L), GroupID = c(45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L),
DateStart = c("2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021",
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021",
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021")),
class = "data.frame", row.names = c(NA, -42L))
y
ID GroupID DateStart
1 1 45 2/11/2021
2 1 65 2/11/2021
3 1 45 2/12/2021
4 1 65 2/12/2021
5 1 45 2/13/2021
6 1 65 2/13/2021
7 1 45 2/14/2021
8 1 65 2/14/2021
9 1 45 2/15/2021
10 1 65 2/15/2021
11 1 45 2/16/2021
12 1 65 2/16/2021
13 1 45 2/17/2021
14 1 65 2/17/2021
15 2 45 2/11/2021
16 2 65 2/11/2021
17 2 45 2/12/2021
18 2 65 2/12/2021
19 2 45 2/13/2021
20 2 65 2/13/2021
21 2 45 2/14/2021
22 2 65 2/14/2021
23 2 45 2/15/2021
24 2 65 2/15/2021
25 2 45 2/16/2021
26 2 65 2/16/2021
27 2 45 2/17/2021
28 2 65 2/17/2021
29 3 45 2/11/2021
30 3 65 2/11/2021
31 3 45 2/12/2021
32 3 65 2/12/2021
33 3 45 2/13/2021
34 3 65 2/13/2021
35 3 45 2/14/2021
36 3 65 2/14/2021
37 3 45 2/15/2021
38 3 65 2/15/2021
39 3 45 2/16/2021
40 3 65 2/16/2021
41 3 45 2/17/2021
42 3 65 2/17/2021
我希望最终得到什么:
y
ID GroupID DateStart Dummy
1 1 45 2/11/2021 1
2 1 65 2/11/2021 NA
3 1 45 2/12/2021 1
4 1 65 2/12/2021 NA
5 1 45 2/13/2021 1
6 1 65 2/13/2021 NA
7 1 45 2/14/2021 NA
8 1 65 2/14/2021 1
9 1 45 2/15/2021 NA
10 1 65 2/15/2021 1
11 1 45 2/16/2021 NA
12 1 65 2/16/2021 NA
13 1 45 2/17/2021 NA
14 1 65 2/17/2021 NA
15 2 45 2/11/2021 NA
16 2 65 2/11/2021 NA
17 2 45 2/12/2021 NA
18 2 65 2/12/2021 NA
19 2 45 2/13/2021 NA
20 2 65 2/13/2021 NA
21 2 45 2/14/2021 NA
22 2 65 2/14/2021 NA
23 2 45 2/15/2021 NA
24 2 65 2/15/2021 NA
25 2 45 2/16/2021 NA
26 2 65 2/16/2021 1
27 2 45 2/17/2021 NA
28 2 65 2/17/2021 1
29 3 45 2/11/2021 1
30 3 65 2/11/2021 NA
31 3 45 2/12/2021 1
32 3 65 2/12/2021 NA
33 3 45 2/13/2021 1
34 3 65 2/13/2021 NA
35 3 45 2/14/2021 1
36 3 65 2/14/2021 NA
37 3 45 2/15/2021 NA
38 3 65 2/15/2021 NA
39 3 45 2/16/2021 NA
40 3 65 2/16/2021 NA
41 3 45 2/17/2021 NA
42 3 65 2/17/2021 NA
结果 y data.frame 给我的是一个新的第 4 列,如果日期介于 data.frame x 中的 DateStart 和 DateEnd 之间,我们有一个 1,按 GroupID 和 ID 分组。 例如,对于 y 中日期(包括)2/10/2021 和 2/14/2021 之间的 GroupID=45 的每个 ID=3,我希望循环或 function 在虚拟列中指定 1。 对于 y 中与 x 中的条件不对应的那些日期,我想要一个 NA。
对于 y 中的每一行,我基本上需要通读 x 中的每一行,如果根据 groupID 和 ID,日期落在指定的范围内,则给我一个 1。
我的真实数据集非常大(大约 200 万次观察),所以我也在寻找一种快速的方法来做到这一点。
我对此进行了尝试: R: Check if value from dataframe is within range other dataframe
但没有骰子。
提前致谢!
left_join
mdy
function 获取 class 日期group_by
并按行操作mutate
between
Dumsy
中对ifelse
进行变异:library(dplyr)
library(lubridate)
left_join(y, x, by=c("ID", "GroupID")) %>%
mutate(across(starts_with("Date"), mdy)) %>%
group_by(ID, GroupID) %>%
rowwise() %>%
mutate(Dumsy = ifelse(between(DateStart.x, DateStart.y, DateEnd), 1, NA)) %>%
select(ID, GroupID, DateStart=DateStart.x, Dumsy)
ID GroupID DateStart Dumsy
<int> <int> <date> <dbl>
1 1 45 2021-02-11 1
2 1 65 2021-02-11 NA
3 1 45 2021-02-12 1
4 1 65 2021-02-12 NA
5 1 45 2021-02-13 1
6 1 65 2021-02-13 NA
7 1 45 2021-02-14 NA
8 1 65 2021-02-14 1
9 1 45 2021-02-15 NA
10 1 65 2021-02-15 1
# ... with 32 more rows
您需要的是y
和x
之间的左连接,以便y
的每一行都有适当的日期截止值。 之后,从简单的ifelse()
创建虚拟变量。 这是使用 dplyr 的解决方案
library(tidyverse)
y %>%
# Rename for clarity
rename(date = DateStart) %>%
left_join(x, by = c("ID", "GroupID")) %>%
# Convert all the date columns to dates
mutate(across(c(date, DateStart, DateEnd), as.Date, format = "%m/%d/%Y")) %>%
mutate(dummy = ifelse(date >= DateStart & date <= DateEnd, 1, NA))
Output(注意我将您的数据帧转换为小标题):
# A tibble: 42 x 6
ID GroupID date DateStart DateEnd dummy
<int> <int> <date> <date> <date> <dbl>
1 1 45 2021-02-11 2021-02-11 2021-02-13 1
2 1 65 2021-02-11 2021-02-14 2021-02-15 NA
3 1 45 2021-02-12 2021-02-11 2021-02-13 1
4 1 65 2021-02-12 2021-02-14 2021-02-15 NA
5 1 45 2021-02-13 2021-02-11 2021-02-13 1
6 1 65 2021-02-13 2021-02-14 2021-02-15 NA
7 1 45 2021-02-14 2021-02-11 2021-02-13 NA
8 1 65 2021-02-14 2021-02-14 2021-02-15 1
9 1 45 2021-02-15 2021-02-11 2021-02-13 NA
10 1 65 2021-02-15 2021-02-14 2021-02-15 1
# ... with 32 more rows
使用lubridate
package 和查找环境的替代(可扩展)方法:
library(tidyverse)
library(lubridate)
# create a lookup table with date intervals
h <- as.list(interval(mdy(x$DateStart), mdy(x$DateEnd))) %>%
setNames(paste0(x$ID, x$GroupID)) %>%
list2env(hash = T)
# define functions for finding values in hash table
`%||%` <- function(a, b) if(is.null(a)) b else a
lookup <- Vectorize(function(x, y, env) {
y %within% {env[[x]] %||% return(NA)} # is date in range of hash table value?
}, c("x", "y"))
y %>%
mutate(ID_comb = paste0(ID, GroupID),
DateStart = mdy(DateStart),
new = lookup(ID_comb, DateStart, h))
ID GroupID DateStart ID_comb new
1 1 45 2021-02-11 145 TRUE
2 1 65 2021-02-11 165 FALSE
3 1 45 2021-02-12 145 TRUE
4 1 65 2021-02-12 165 FALSE
5 1 45 2021-02-13 145 TRUE
6 1 65 2021-02-13 165 FALSE
7 1 45 2021-02-14 145 FALSE
8 1 65 2021-02-14 165 TRUE
9 1 45 2021-02-15 145 FALSE
10 1 65 2021-02-15 165 TRUE
11 1 45 2021-02-16 145 FALSE
12 1 65 2021-02-16 165 FALSE
13 1 45 2021-02-17 145 FALSE
14 1 65 2021-02-17 165 FALSE
15 2 45 2021-02-11 245 FALSE
16 2 65 2021-02-11 265 FALSE
17 2 45 2021-02-12 245 FALSE
18 2 65 2021-02-12 265 FALSE
19 2 45 2021-02-13 245 FALSE
20 2 65 2021-02-13 265 FALSE
21 2 45 2021-02-14 245 FALSE
22 2 65 2021-02-14 265 FALSE
23 2 45 2021-02-15 245 FALSE
24 2 65 2021-02-15 265 FALSE
25 2 45 2021-02-16 245 FALSE
26 2 65 2021-02-16 265 TRUE
27 2 45 2021-02-17 245 FALSE
28 2 65 2021-02-17 265 TRUE
29 3 45 2021-02-11 345 TRUE
30 3 65 2021-02-11 365 NA
31 3 45 2021-02-12 345 TRUE
32 3 65 2021-02-12 365 NA
33 3 45 2021-02-13 345 TRUE
34 3 65 2021-02-13 365 NA
35 3 45 2021-02-14 345 TRUE
36 3 65 2021-02-14 365 NA
37 3 45 2021-02-15 345 FALSE
38 3 65 2021-02-15 365 NA
39 3 45 2021-02-16 345 FALSE
40 3 65 2021-02-16 365 NA
41 3 45 2021-02-17 345 FALSE
42 3 65 2021-02-17 365 NA
请注意,示例日期中不存在 ID 3 和 GroupID 65 的组合。 我选择保留 Boolean 值以区分找到的匹配项、错误匹配项和真实错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.