简体   繁体   中英

How to left join two data frames conditionally - by rows that fall within a date range - and by two variables found in each data frame

I have two simulated data frames:

  1. d, created below, which has all the rows of longitudinal data for two different people. Each row has a start and end date. Some rows have a visit date associated with them because the person had a visit during that time frame.
  2. d2, created below (printed in second block of code), which has only one row per visit date per person (this is the 'backbone' or 'finder file' data frame) onto which I want to merge new rows.

Problem: I want to merge ONLY the rows from d that contain time that falls one month before each visit date for each person to the rows in d2. To get the data to merge properly, I need to merge by both the person and the visit date. In the end, I want a data set that contains only the rows whose start and end dates contain the one month before each visit date, and I want to keep all columns. I need a solution that does not use data.table .

If it worked: From d it would pull,

  • person 1's rows 2 and 3 (b/c they contain the 1 month before person 1's first visit date)
  • person 1's row 5 (b/c it contains the 1 month before person 1's second visit date)
  • person 2's row 3 (b/c it contains the 1 month before person 2's visit date)

It would then merge each set of rows to the person and visit date of the same values in d2 (a 'one-to-many' merge).

Tricky part: if person 1's row 4 overlapped w/the 1 month before the visit date contained in person 1's row 5, I'd want to pull that row 4 & attach it to that second visit date (but NOT pull that row 4 & attach it to the first visit date b/c it doesn't contain the month before the first visit date).

Below, I've tried this 4 different ways , and annotated the error message associated with each.

#Load packages
pacman::p_load(dplyr, tidyr, lubridate, sqldf)

#Create data frame 1
#Create variables for data frame 1 
person <- c(1, 1, 1, 1, 1, 2, 2, 2)
start <- c('2016-06-17', '2016-10-01', '2017-01-01', '2017-01-15', '2017-06-05', '2014-12-14', '2015-01-01', '2015-01-19')
end <- c('2016-09-30', '2016-12-31', '2017-01-14', '2017-06-04', '2017-09-03', '2014-12-31', '2015-01-18', '2015-07-03')
visit <- c(NA, NA, NA, '2017-01-15', '2017-08-01', NA, NA, '2015-02-22')
row <- c(1, 2, 3, 4, 5, 1, 2, 3)

#Populate data frame 1 with variables
d <- cbind(person, row)
d <- as.data.frame(d)

#Format dates and add to data frame 1
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$visit <- as.Date(visit, format = '%Y-%m-%d')

######

#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')

#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)

#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')

#Need to merge conditionally such that only rows from d that contain time one month before 'visit' are selected, and then merged by both 'person' and 'visit' to d2 (the 'backbone' data set)
#Attempt 1 yields this error message:
#Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <=  : 
 # the condition has length > 1
if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$end)) {
  left_join(x = d, y = d2, by = c('person' = 'person_2', 'visit' = 'visit_2'))
}
#> Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= : the condition has length > 1

#Attempt 2 - Error: tinyformat: Too many conversion specifiers in format string
result = sqldf('
  select *
  from back left join d on 
  (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$EndDate))
  AND d2.person_2 = d.person
  And d2.visit_2 = d.visit
  ')
#> Error: tinyformat: Too many conversion specifiers in format string

#Attempt 3 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
      FROM d2
      LEFT JOIN d ON (visit %m-% months(1)) BETWEEN start and end')
#> Error: tinyformat: Too many conversion specifiers in format string

#Attempt 4 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
  FROM d2 
  LEFT JOIN
  d on 
      d2.person_2 = d.person and
      d2.visit_2 = d.visit and 
      (d$visit %m-% months(1) >= d$start) and 
      (d$visit %m-% months(1) <= d$end)')
#> Error: tinyformat: Too many conversion specifiers in format string
      
d
#>   person row      start        end      visit
#> 1      1   1 2016-06-17 2016-09-30       <NA>
#> 2      1   2 2016-10-01 2016-12-31       <NA>
#> 3      1   3 2017-01-01 2017-01-14       <NA>
#> 4      1   4 2017-01-15 2017-06-04 2017-01-15
#> 5      1   5 2017-06-05 2017-09-03 2017-08-01
#> 6      2   1 2014-12-14 2014-12-31       <NA>
#> 7      2   2 2015-01-01 2015-01-18       <NA>
#> 8      2   3 2015-01-19 2015-07-03 2015-02-22

Created on 2022-05-13 by the reprex package (v2.0.1)

#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')

#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)

#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
d2
#>   person_2    visit_2
#> 1        1 2017-01-15
#> 2        1 2017-08-01
#> 3        2 2015-02-22

Created on 2022-05-13 by the reprex package (v2.0.1)

d2 %>%
  mutate(visit_2_m1 = visit_2 %m-% months(1)) %>%
  fuzzyjoin::fuzzy_left_join(
    d, ., by = c("start" = "visit_2_m1", "end" = "visit_2_m1"),
    match_fun = list(`<=`, `>=`))
#   person row      start        end      visit person_2    visit_2 visit_2_m1
# 1      1   1 2016-06-17 2016-09-30       <NA>       NA       <NA>       <NA>
# 2      1   2 2016-10-01 2016-12-31       <NA>        1 2017-01-15 2016-12-15
# 3      1   3 2017-01-01 2017-01-14       <NA>       NA       <NA>       <NA>
# 4      1   4 2017-01-15 2017-06-04 2017-01-15       NA       <NA>       <NA>
# 5      1   5 2017-06-05 2017-09-03 2017-08-01        1 2017-08-01 2017-07-01
# 6      2   1 2014-12-14 2014-12-31       <NA>       NA       <NA>       <NA>
# 7      2   2 2015-01-01 2015-01-18       <NA>       NA       <NA>       <NA>
# 8      2   3 2015-01-19 2015-07-03 2015-02-22        2 2015-02-22 2015-01-22

The description of your problem is unclear and I'm not sure I've understood exactly what you're trying to achieve. How many rows should be returned in the example above? Do you just want to join a subset of the rows in d to d2?

Here is a solution using dplyr only which may be what you are looking for:

  d %>% 
  mutate(
    date_interval = start %--% end,
    visit_within = visit %m-% months(1) %within% date_interval
  ) %>% 
  filter(visit_within == TRUE) %>% 
  left_join(d2, ., by = c("person_2" = "person"))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM