I have two simulated data frames:
Problem: I want to merge ONLY the rows from d that contain time that falls one month before each visit date for each person to the rows in d2. To get the data to merge properly, I need to merge by both the person and the visit date. In the end, I want a data set that contains only the rows whose start and end dates contain the one month before each visit date, and I want to keep all columns. I need a solution that does not use data.table .
If it worked: From d it would pull,
It would then merge each set of rows to the person and visit date of the same values in d2 (a 'one-to-many' merge).
Tricky part: if person 1's row 4 overlapped w/the 1 month before the visit date contained in person 1's row 5, I'd want to pull that row 4 & attach it to that second visit date (but NOT pull that row 4 & attach it to the first visit date b/c it doesn't contain the month before the first visit date).
Below, I've tried this 4 different ways , and annotated the error message associated with each.
#Load packages
pacman::p_load(dplyr, tidyr, lubridate, sqldf)
#Create data frame 1
#Create variables for data frame 1
person <- c(1, 1, 1, 1, 1, 2, 2, 2)
start <- c('2016-06-17', '2016-10-01', '2017-01-01', '2017-01-15', '2017-06-05', '2014-12-14', '2015-01-01', '2015-01-19')
end <- c('2016-09-30', '2016-12-31', '2017-01-14', '2017-06-04', '2017-09-03', '2014-12-31', '2015-01-18', '2015-07-03')
visit <- c(NA, NA, NA, '2017-01-15', '2017-08-01', NA, NA, '2015-02-22')
row <- c(1, 2, 3, 4, 5, 1, 2, 3)
#Populate data frame 1 with variables
d <- cbind(person, row)
d <- as.data.frame(d)
#Format dates and add to data frame 1
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$visit <- as.Date(visit, format = '%Y-%m-%d')
######
#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')
#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)
#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
#Need to merge conditionally such that only rows from d that contain time one month before 'visit' are selected, and then merged by both 'person' and 'visit' to d2 (the 'backbone' data set)
#Attempt 1 yields this error message:
#Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= :
# the condition has length > 1
if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$end)) {
left_join(x = d, y = d2, by = c('person' = 'person_2', 'visit' = 'visit_2'))
}
#> Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= : the condition has length > 1
#Attempt 2 - Error: tinyformat: Too many conversion specifiers in format string
result = sqldf('
select *
from back left join d on
(((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$EndDate))
AND d2.person_2 = d.person
And d2.visit_2 = d.visit
')
#> Error: tinyformat: Too many conversion specifiers in format string
#Attempt 3 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
FROM d2
LEFT JOIN d ON (visit %m-% months(1)) BETWEEN start and end')
#> Error: tinyformat: Too many conversion specifiers in format string
#Attempt 4 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
FROM d2
LEFT JOIN
d on
d2.person_2 = d.person and
d2.visit_2 = d.visit and
(d$visit %m-% months(1) >= d$start) and
(d$visit %m-% months(1) <= d$end)')
#> Error: tinyformat: Too many conversion specifiers in format string
d
#> person row start end visit
#> 1 1 1 2016-06-17 2016-09-30 <NA>
#> 2 1 2 2016-10-01 2016-12-31 <NA>
#> 3 1 3 2017-01-01 2017-01-14 <NA>
#> 4 1 4 2017-01-15 2017-06-04 2017-01-15
#> 5 1 5 2017-06-05 2017-09-03 2017-08-01
#> 6 2 1 2014-12-14 2014-12-31 <NA>
#> 7 2 2 2015-01-01 2015-01-18 <NA>
#> 8 2 3 2015-01-19 2015-07-03 2015-02-22
Created on 2022-05-13 by the reprex package (v2.0.1)
#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')
#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)
#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
d2
#> person_2 visit_2
#> 1 1 2017-01-15
#> 2 1 2017-08-01
#> 3 2 2015-02-22
Created on 2022-05-13 by the reprex package (v2.0.1)
d2 %>%
mutate(visit_2_m1 = visit_2 %m-% months(1)) %>%
fuzzyjoin::fuzzy_left_join(
d, ., by = c("start" = "visit_2_m1", "end" = "visit_2_m1"),
match_fun = list(`<=`, `>=`))
# person row start end visit person_2 visit_2 visit_2_m1
# 1 1 1 2016-06-17 2016-09-30 <NA> NA <NA> <NA>
# 2 1 2 2016-10-01 2016-12-31 <NA> 1 2017-01-15 2016-12-15
# 3 1 3 2017-01-01 2017-01-14 <NA> NA <NA> <NA>
# 4 1 4 2017-01-15 2017-06-04 2017-01-15 NA <NA> <NA>
# 5 1 5 2017-06-05 2017-09-03 2017-08-01 1 2017-08-01 2017-07-01
# 6 2 1 2014-12-14 2014-12-31 <NA> NA <NA> <NA>
# 7 2 2 2015-01-01 2015-01-18 <NA> NA <NA> <NA>
# 8 2 3 2015-01-19 2015-07-03 2015-02-22 2 2015-02-22 2015-01-22
The description of your problem is unclear and I'm not sure I've understood exactly what you're trying to achieve. How many rows should be returned in the example above? Do you just want to join a subset of the rows in d to d2?
Here is a solution using dplyr only which may be what you are looking for:
d %>%
mutate(
date_interval = start %--% end,
visit_within = visit %m-% months(1) %within% date_interval
) %>%
filter(visit_within == TRUE) %>%
left_join(d2, ., by = c("person_2" = "person"))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.