![](/img/trans.png)
[英]How to join two data frames by rows based on column value of first data frame in R?
[英]How to left join two data frames conditionally - by rows that fall within a date range - and by two variables found in each data frame
我有两个模拟数据框:
问题:我只想将 d 中包含每个人的每次访问日期前一个月的时间的行合并到 d2 中的行。 为了使数据正确合并,我需要按人员和访问日期进行合并。 最后,我想要一个数据集,其中只包含开始和结束日期包含每个访问日期前一个月的行,并且我想保留所有列。 我需要一个不使用 data.table 的解决方案。
如果它有效:从 d 它会拉,
然后它将每组行合并到 d2 中相同值的人员和访问日期(“一对多”合并)。
棘手的部分:如果人 1 的第 4 行与人 1 的第 5 行中包含的访问日期前 1 个月重叠,我想拉第 4 行并将其附加到第二次访问日期(但不要拉第 4 行和将其附加到第一次访问日期 b/c 它不包含第一次访问日期之前的月份)。
下面,我尝试了这 4 种不同的方法,并注释了与每种方法相关的错误消息。
#Load packages
pacman::p_load(dplyr, tidyr, lubridate, sqldf)
#Create data frame 1
#Create variables for data frame 1
person <- c(1, 1, 1, 1, 1, 2, 2, 2)
start <- c('2016-06-17', '2016-10-01', '2017-01-01', '2017-01-15', '2017-06-05', '2014-12-14', '2015-01-01', '2015-01-19')
end <- c('2016-09-30', '2016-12-31', '2017-01-14', '2017-06-04', '2017-09-03', '2014-12-31', '2015-01-18', '2015-07-03')
visit <- c(NA, NA, NA, '2017-01-15', '2017-08-01', NA, NA, '2015-02-22')
row <- c(1, 2, 3, 4, 5, 1, 2, 3)
#Populate data frame 1 with variables
d <- cbind(person, row)
d <- as.data.frame(d)
#Format dates and add to data frame 1
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$visit <- as.Date(visit, format = '%Y-%m-%d')
######
#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')
#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)
#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
#Need to merge conditionally such that only rows from d that contain time one month before 'visit' are selected, and then merged by both 'person' and 'visit' to d2 (the 'backbone' data set)
#Attempt 1 yields this error message:
#Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= :
# the condition has length > 1
if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$end)) {
left_join(x = d, y = d2, by = c('person' = 'person_2', 'visit' = 'visit_2'))
}
#> Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= : the condition has length > 1
#Attempt 2 - Error: tinyformat: Too many conversion specifiers in format string
result = sqldf('
select *
from back left join d on
(((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$EndDate))
AND d2.person_2 = d.person
And d2.visit_2 = d.visit
')
#> Error: tinyformat: Too many conversion specifiers in format string
#Attempt 3 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
FROM d2
LEFT JOIN d ON (visit %m-% months(1)) BETWEEN start and end')
#> Error: tinyformat: Too many conversion specifiers in format string
#Attempt 4 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
FROM d2
LEFT JOIN
d on
d2.person_2 = d.person and
d2.visit_2 = d.visit and
(d$visit %m-% months(1) >= d$start) and
(d$visit %m-% months(1) <= d$end)')
#> Error: tinyformat: Too many conversion specifiers in format string
d
#> person row start end visit
#> 1 1 1 2016-06-17 2016-09-30 <NA>
#> 2 1 2 2016-10-01 2016-12-31 <NA>
#> 3 1 3 2017-01-01 2017-01-14 <NA>
#> 4 1 4 2017-01-15 2017-06-04 2017-01-15
#> 5 1 5 2017-06-05 2017-09-03 2017-08-01
#> 6 2 1 2014-12-14 2014-12-31 <NA>
#> 7 2 2 2015-01-01 2015-01-18 <NA>
#> 8 2 3 2015-01-19 2015-07-03 2015-02-22
由reprex 包(v2.0.1) 创建于 2022-05-13
#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')
#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)
#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
d2
#> person_2 visit_2
#> 1 1 2017-01-15
#> 2 1 2017-08-01
#> 3 2 2015-02-22
由reprex 包(v2.0.1) 创建于 2022-05-13
d2 %>%
mutate(visit_2_m1 = visit_2 %m-% months(1)) %>%
fuzzyjoin::fuzzy_left_join(
d, ., by = c("start" = "visit_2_m1", "end" = "visit_2_m1"),
match_fun = list(`<=`, `>=`))
# person row start end visit person_2 visit_2 visit_2_m1
# 1 1 1 2016-06-17 2016-09-30 <NA> NA <NA> <NA>
# 2 1 2 2016-10-01 2016-12-31 <NA> 1 2017-01-15 2016-12-15
# 3 1 3 2017-01-01 2017-01-14 <NA> NA <NA> <NA>
# 4 1 4 2017-01-15 2017-06-04 2017-01-15 NA <NA> <NA>
# 5 1 5 2017-06-05 2017-09-03 2017-08-01 1 2017-08-01 2017-07-01
# 6 2 1 2014-12-14 2014-12-31 <NA> NA <NA> <NA>
# 7 2 2 2015-01-01 2015-01-18 <NA> NA <NA> <NA>
# 8 2 3 2015-01-19 2015-07-03 2015-02-22 2 2015-02-22 2015-01-22
您的问题的描述不清楚,我不确定我是否完全理解您想要实现的目标。 在上面的示例中应该返回多少行? 您只想将 d 中的行的子集连接到 d2 吗?
这是一个仅使用 dplyr 的解决方案,这可能是您正在寻找的:
d %>%
mutate(
date_interval = start %--% end,
visit_within = visit %m-% months(1) %within% date_interval
) %>%
filter(visit_within == TRUE) %>%
left_join(d2, ., by = c("person_2" = "person"))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.