繁体   English   中英

如何查找第一个 DF 中的日期是否在另一个数据框中的日期范围内?

[英]How do I find if a date in the first DF falls within the range of dates in another data frame?

这有点棘手。 我有两个数据框,需要帮助创建 function 或某种循环来帮助我确定 data.frame x 中的值是否介于 data.frame y 中的两列值之间。

因此,例如:

数据帧 x:

x <- structure(list(ID = c(1L, 1L, 3L,  2L, 2L), GroupID = c(45L,65L, 45L, 65L,45L), DateStart = c("2/11/2021", 
"2/14/2021", "2/10/2021, "2/16/2021","2/19/2021"), DateEnd = c("2/13/2021", 
"2/15/2021", "2/14/2021","2/18/2021", "2/22/2021")), 
class = "data.frame", row.names = c(NA, -4L))

x

  ID GroupID DateStart   DateEnd
1  1      45 2/11/2021 2/13/2021
2  1      65 2/14/2021 2/15/2021
3  3      45 2/10/2021 2/14/2021
4  2      65 2/16/2021 2/18/2021
5  2      45 2/19/2021 2/22/2021

然后是

y <- structure(list(ID = c(1L, 1L, 1L, 1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,1L,
2L 2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L,2L, 
3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L,3L), GroupID = c(45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,
45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L,45L,65L), 
DateStart = c("2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021", 
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021",
"2/11/2021","2/11/2021","2/12/2021","2/12/2021","2/13/2021","2/13/2021","2/14/2021",
"2/14/2021","2/15/2021","2/15/2021","2/16/2021","2/16/2021","2/17/2021","2/17/2021")), 
class = "data.frame", row.names = c(NA, -42L))

y

ID GroupID DateStart
1   1      45 2/11/2021
2   1      65 2/11/2021
3   1      45 2/12/2021
4   1      65 2/12/2021
5   1      45 2/13/2021
6   1      65 2/13/2021
7   1      45 2/14/2021
8   1      65 2/14/2021
9   1      45 2/15/2021
10  1      65 2/15/2021
11  1      45 2/16/2021
12  1      65 2/16/2021
13  1      45 2/17/2021
14  1      65 2/17/2021
15  2      45 2/11/2021
16  2      65 2/11/2021
17  2      45 2/12/2021
18  2      65 2/12/2021
19  2      45 2/13/2021
20  2      65 2/13/2021
21  2      45 2/14/2021
22  2      65 2/14/2021
23  2      45 2/15/2021
24  2      65 2/15/2021
25  2      45 2/16/2021
26  2      65 2/16/2021
27  2      45 2/17/2021
28  2      65 2/17/2021
29  3      45 2/11/2021
30  3      65 2/11/2021
31  3      45 2/12/2021
32  3      65 2/12/2021
33  3      45 2/13/2021
34  3      65 2/13/2021
35  3      45 2/14/2021
36  3      65 2/14/2021
37  3      45 2/15/2021
38  3      65 2/15/2021
39  3      45 2/16/2021
40  3      65 2/16/2021
41  3      45 2/17/2021
42  3      65 2/17/2021

我希望最终得到什么:

y
   ID GroupID DateStart Dummy
1   1      45 2/11/2021     1
2   1      65 2/11/2021    NA
3   1      45 2/12/2021     1
4   1      65 2/12/2021    NA
5   1      45 2/13/2021     1
6   1      65 2/13/2021    NA
7   1      45 2/14/2021    NA
8   1      65 2/14/2021     1
9   1      45 2/15/2021    NA
10  1      65 2/15/2021     1
11  1      45 2/16/2021    NA
12  1      65 2/16/2021    NA
13  1      45 2/17/2021    NA
14  1      65 2/17/2021    NA
15  2      45 2/11/2021    NA
16  2      65 2/11/2021    NA
17  2      45 2/12/2021    NA
18  2      65 2/12/2021    NA
19  2      45 2/13/2021    NA
20  2      65 2/13/2021    NA
21  2      45 2/14/2021    NA
22  2      65 2/14/2021    NA
23  2      45 2/15/2021    NA
24  2      65 2/15/2021    NA
25  2      45 2/16/2021    NA
26  2      65 2/16/2021     1
27  2      45 2/17/2021    NA
28  2      65 2/17/2021     1
29  3      45 2/11/2021     1
30  3      65 2/11/2021    NA
31  3      45 2/12/2021     1
32  3      65 2/12/2021    NA
33  3      45 2/13/2021     1
34  3      65 2/13/2021    NA
35  3      45 2/14/2021     1
36  3      65 2/14/2021    NA
37  3      45 2/15/2021    NA
38  3      65 2/15/2021    NA
39  3      45 2/16/2021    NA
40  3      65 2/16/2021    NA
41  3      45 2/17/2021    NA
42  3      65 2/17/2021    NA

结果 y data.frame 给我的是一个新的第 4 列,如果日期介于 data.frame x 中的 DateStart 和 DateEnd 之间,我们有一个 1,按 GroupID 和 ID 分组。 例如,对于 y 中日期(包括)2/10/2021 和 2/14/2021 之间的 GroupID=45 的每个 ID=3,我希望循环或 function 在虚拟列中指定 1。 对于 y 中与 x 中的条件不对应的那些日期,我想要一个 NA。

对于 y 中的每一行,我基本上需要通读 x 中的每一行,如果根据 groupID 和 ID,日期落在指定的范围内,则给我一个 1。

我的真实数据集非常大(大约 200 万次观察),所以我也在寻找一种快速的方法来做到这一点。

我对此进行了尝试: R: Check if value from dataframe is within range other dataframe

但没有骰子。

提前致谢!

  1. left_join
  2. 使用mdy function 获取 class 日期
  3. group_by并按行操作
  4. 使用mutate between Dumsy中对ifelse进行变异:
library(dplyr)
library(lubridate)

left_join(y, x, by=c("ID", "GroupID")) %>% 
  mutate(across(starts_with("Date"), mdy)) %>% 
  group_by(ID, GroupID) %>% 
  rowwise() %>% 
  mutate(Dumsy = ifelse(between(DateStart.x, DateStart.y, DateEnd), 1, NA)) %>% 
  select(ID, GroupID, DateStart=DateStart.x, Dumsy)
     ID GroupID DateStart  Dumsy
   <int>   <int> <date>     <dbl>
 1     1      45 2021-02-11     1
 2     1      65 2021-02-11    NA
 3     1      45 2021-02-12     1
 4     1      65 2021-02-12    NA
 5     1      45 2021-02-13     1
 6     1      65 2021-02-13    NA
 7     1      45 2021-02-14    NA
 8     1      65 2021-02-14     1
 9     1      45 2021-02-15    NA
10     1      65 2021-02-15     1
# ... with 32 more rows

您需要的是yx之间的左连接,以便y的每一行都有适当的日期截止值。 之后,从简单的ifelse()创建虚拟变量。 这是使用 dplyr 的解决方案

library(tidyverse)
y %>% 
  # Rename for clarity
  rename(date = DateStart) %>% 
  left_join(x, by = c("ID", "GroupID")) %>% 
  # Convert all the date columns to dates
  mutate(across(c(date, DateStart, DateEnd), as.Date, format = "%m/%d/%Y")) %>% 
  mutate(dummy = ifelse(date >= DateStart & date <= DateEnd, 1, NA))

Output(注意我将您的数据帧转换为小标题):

# A tibble: 42 x 6
      ID GroupID date       DateStart  DateEnd    dummy
   <int>   <int> <date>     <date>     <date>     <dbl>
 1     1      45 2021-02-11 2021-02-11 2021-02-13     1
 2     1      65 2021-02-11 2021-02-14 2021-02-15    NA
 3     1      45 2021-02-12 2021-02-11 2021-02-13     1
 4     1      65 2021-02-12 2021-02-14 2021-02-15    NA
 5     1      45 2021-02-13 2021-02-11 2021-02-13     1
 6     1      65 2021-02-13 2021-02-14 2021-02-15    NA
 7     1      45 2021-02-14 2021-02-11 2021-02-13    NA
 8     1      65 2021-02-14 2021-02-14 2021-02-15     1
 9     1      45 2021-02-15 2021-02-11 2021-02-13    NA
10     1      65 2021-02-15 2021-02-14 2021-02-15     1
# ... with 32 more rows

使用lubridate package 和查找环境的替代(可扩展)方法:

library(tidyverse)
library(lubridate)
# create a lookup table with date intervals
h <- as.list(interval(mdy(x$DateStart), mdy(x$DateEnd))) %>%
  setNames(paste0(x$ID, x$GroupID)) %>%
  list2env(hash = T)

# define functions for finding values in hash table
`%||%` <- function(a, b) if(is.null(a)) b else a
lookup <- Vectorize(function(x, y, env) {
  y %within% {env[[x]] %||% return(NA)} # is date in range of hash table value?
}, c("x", "y"))

y %>%
  mutate(ID_comb = paste0(ID, GroupID),
         DateStart = mdy(DateStart),
         new = lookup(ID_comb, DateStart, h))

   ID GroupID  DateStart ID_comb   new
1   1      45 2021-02-11     145  TRUE
2   1      65 2021-02-11     165 FALSE
3   1      45 2021-02-12     145  TRUE
4   1      65 2021-02-12     165 FALSE
5   1      45 2021-02-13     145  TRUE
6   1      65 2021-02-13     165 FALSE
7   1      45 2021-02-14     145 FALSE
8   1      65 2021-02-14     165  TRUE
9   1      45 2021-02-15     145 FALSE
10  1      65 2021-02-15     165  TRUE
11  1      45 2021-02-16     145 FALSE
12  1      65 2021-02-16     165 FALSE
13  1      45 2021-02-17     145 FALSE
14  1      65 2021-02-17     165 FALSE
15  2      45 2021-02-11     245 FALSE
16  2      65 2021-02-11     265 FALSE
17  2      45 2021-02-12     245 FALSE
18  2      65 2021-02-12     265 FALSE
19  2      45 2021-02-13     245 FALSE
20  2      65 2021-02-13     265 FALSE
21  2      45 2021-02-14     245 FALSE
22  2      65 2021-02-14     265 FALSE
23  2      45 2021-02-15     245 FALSE
24  2      65 2021-02-15     265 FALSE
25  2      45 2021-02-16     245 FALSE
26  2      65 2021-02-16     265  TRUE
27  2      45 2021-02-17     245 FALSE
28  2      65 2021-02-17     265  TRUE
29  3      45 2021-02-11     345  TRUE
30  3      65 2021-02-11     365    NA
31  3      45 2021-02-12     345  TRUE
32  3      65 2021-02-12     365    NA
33  3      45 2021-02-13     345  TRUE
34  3      65 2021-02-13     365    NA
35  3      45 2021-02-14     345  TRUE
36  3      65 2021-02-14     365    NA
37  3      45 2021-02-15     345 FALSE
38  3      65 2021-02-15     365    NA
39  3      45 2021-02-16     345 FALSE
40  3      65 2021-02-16     365    NA
41  3      45 2021-02-17     345 FALSE
42  3      65 2021-02-17     365    NA

请注意,示例日期中不存在 ID 3 和 GroupID 65 的组合。 我选择保留 Boolean 值以区分找到的匹配项、错误匹配项和真实错误。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM