简体   繁体   English

如何有条件地左连接两个数据框 - 按日期范围内的行 - 以及在每个数据框中找到的两个变量

[英]How to left join two data frames conditionally - by rows that fall within a date range - and by two variables found in each data frame

I have two simulated data frames:我有两个模拟数据框:

  1. d, created below, which has all the rows of longitudinal data for two different people. d,在下面创建,其中包含两个不同人的所有纵向数据行。 Each row has a start and end date.每行都有一个开始和结束日期。 Some rows have a visit date associated with them because the person had a visit during that time frame.某些行具有与其关联的访问日期,因为该人在该时间范围内进行了访问。
  2. d2, created below (printed in second block of code), which has only one row per visit date per person (this is the 'backbone' or 'finder file' data frame) onto which I want to merge new rows. d2,在下面创建(打印在第二个代码块中),每个人每次访问日期只有一行(这是“主干”或“查找器文件”数据框),我想在其上合并新行。

Problem: I want to merge ONLY the rows from d that contain time that falls one month before each visit date for each person to the rows in d2.问题:我只想将 d 中包含每个人的每次访问日期前一个月的时间的行合并到 d2 中的行。 To get the data to merge properly, I need to merge by both the person and the visit date.为了使数据正确合并,我需要按人员和访问日期进行合并。 In the end, I want a data set that contains only the rows whose start and end dates contain the one month before each visit date, and I want to keep all columns.最后,我想要一个数据集,其中只包含开始和结束日期包含每个访问日期前一个月的行,并且我想保留所有列。 I need a solution that does not use data.table .需要一个不使用 data.table 的解决方案

If it worked: From d it would pull,如果它有效:从 d 它会拉,

  • person 1's rows 2 and 3 (b/c they contain the 1 month before person 1's first visit date)人 1 的第 2 行和第 3 行(b/c 它们包含人 1 的第一次访问日期之前的 1 个月)
  • person 1's row 5 (b/c it contains the 1 month before person 1's second visit date)人 1 的第 5 行(b/c 它包含人 1 的第二次访问日期之前的 1 个月)
  • person 2's row 3 (b/c it contains the 1 month before person 2's visit date)人 2 的第 3 行(b/c 它包含人 2 访问日期前 1 个月)

It would then merge each set of rows to the person and visit date of the same values in d2 (a 'one-to-many' merge).然后它将每组行合并到 d2 中相同值的人员和访问日期(“一对多”合并)。

Tricky part: if person 1's row 4 overlapped w/the 1 month before the visit date contained in person 1's row 5, I'd want to pull that row 4 & attach it to that second visit date (but NOT pull that row 4 & attach it to the first visit date b/c it doesn't contain the month before the first visit date).棘手的部分:如果人 1 的第 4 行与人 1 的第 5 行中包含的访问日期前 1 个月重叠,我想拉第 4 行并将其附加到第二次访问日期(但不要拉第 4 行和将其附加到第一次访问日期 b/c 它不包含第一次访问日期之前的月份)。

Below, I've tried this 4 different ways , and annotated the error message associated with each.下面,我尝试了这 4 种不同的方法,并注释了与每种方法相关的错误消息。

#Load packages
pacman::p_load(dplyr, tidyr, lubridate, sqldf)

#Create data frame 1
#Create variables for data frame 1 
person <- c(1, 1, 1, 1, 1, 2, 2, 2)
start <- c('2016-06-17', '2016-10-01', '2017-01-01', '2017-01-15', '2017-06-05', '2014-12-14', '2015-01-01', '2015-01-19')
end <- c('2016-09-30', '2016-12-31', '2017-01-14', '2017-06-04', '2017-09-03', '2014-12-31', '2015-01-18', '2015-07-03')
visit <- c(NA, NA, NA, '2017-01-15', '2017-08-01', NA, NA, '2015-02-22')
row <- c(1, 2, 3, 4, 5, 1, 2, 3)

#Populate data frame 1 with variables
d <- cbind(person, row)
d <- as.data.frame(d)

#Format dates and add to data frame 1
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$visit <- as.Date(visit, format = '%Y-%m-%d')

######

#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')

#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)

#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')

#Need to merge conditionally such that only rows from d that contain time one month before 'visit' are selected, and then merged by both 'person' and 'visit' to d2 (the 'backbone' data set)
#Attempt 1 yields this error message:
#Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <=  : 
 # the condition has length > 1
if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$end)) {
  left_join(x = d, y = d2, by = c('person' = 'person_2', 'visit' = 'visit_2'))
}
#> Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= : the condition has length > 1

#Attempt 2 - Error: tinyformat: Too many conversion specifiers in format string
result = sqldf('
  select *
  from back left join d on 
  (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$EndDate))
  AND d2.person_2 = d.person
  And d2.visit_2 = d.visit
  ')
#> Error: tinyformat: Too many conversion specifiers in format string

#Attempt 3 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
      FROM d2
      LEFT JOIN d ON (visit %m-% months(1)) BETWEEN start and end')
#> Error: tinyformat: Too many conversion specifiers in format string

#Attempt 4 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
  FROM d2 
  LEFT JOIN
  d on 
      d2.person_2 = d.person and
      d2.visit_2 = d.visit and 
      (d$visit %m-% months(1) >= d$start) and 
      (d$visit %m-% months(1) <= d$end)')
#> Error: tinyformat: Too many conversion specifiers in format string
      
d
#>   person row      start        end      visit
#> 1      1   1 2016-06-17 2016-09-30       <NA>
#> 2      1   2 2016-10-01 2016-12-31       <NA>
#> 3      1   3 2017-01-01 2017-01-14       <NA>
#> 4      1   4 2017-01-15 2017-06-04 2017-01-15
#> 5      1   5 2017-06-05 2017-09-03 2017-08-01
#> 6      2   1 2014-12-14 2014-12-31       <NA>
#> 7      2   2 2015-01-01 2015-01-18       <NA>
#> 8      2   3 2015-01-19 2015-07-03 2015-02-22

Created on 2022-05-13 by the reprex package (v2.0.1)reprex 包(v2.0.1) 创建于 2022-05-13

#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')

#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)

#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
d2
#>   person_2    visit_2
#> 1        1 2017-01-15
#> 2        1 2017-08-01
#> 3        2 2015-02-22

Created on 2022-05-13 by the reprex package (v2.0.1)reprex 包(v2.0.1) 创建于 2022-05-13

d2 %>%
  mutate(visit_2_m1 = visit_2 %m-% months(1)) %>%
  fuzzyjoin::fuzzy_left_join(
    d, ., by = c("start" = "visit_2_m1", "end" = "visit_2_m1"),
    match_fun = list(`<=`, `>=`))
#   person row      start        end      visit person_2    visit_2 visit_2_m1
# 1      1   1 2016-06-17 2016-09-30       <NA>       NA       <NA>       <NA>
# 2      1   2 2016-10-01 2016-12-31       <NA>        1 2017-01-15 2016-12-15
# 3      1   3 2017-01-01 2017-01-14       <NA>       NA       <NA>       <NA>
# 4      1   4 2017-01-15 2017-06-04 2017-01-15       NA       <NA>       <NA>
# 5      1   5 2017-06-05 2017-09-03 2017-08-01        1 2017-08-01 2017-07-01
# 6      2   1 2014-12-14 2014-12-31       <NA>       NA       <NA>       <NA>
# 7      2   2 2015-01-01 2015-01-18       <NA>       NA       <NA>       <NA>
# 8      2   3 2015-01-19 2015-07-03 2015-02-22        2 2015-02-22 2015-01-22

The description of your problem is unclear and I'm not sure I've understood exactly what you're trying to achieve.您的问题的描述不清楚,我不确定我是否完全理解您想要实现的目标。 How many rows should be returned in the example above?在上面的示例中应该返回多少行? Do you just want to join a subset of the rows in d to d2?您只想将 d 中的行的子集连接到 d2 吗?

Here is a solution using dplyr only which may be what you are looking for:这是一个仅使用 dplyr 的解决方案,这可能是您正在寻找的:

  d %>% 
  mutate(
    date_interval = start %--% end,
    visit_within = visit %m-% months(1) %within% date_interval
  ) %>% 
  filter(visit_within == TRUE) %>% 
  left_join(d2, ., by = c("person_2" = "person"))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM