![](/img/trans.png)
[英]How to join two data frames by rows based on column value of first data frame in R?
[英]How to left join two data frames conditionally - by rows that fall within a date range - and by two variables found in each data frame
我有兩個模擬數據框:
問題:我只想將 d 中包含每個人的每次訪問日期前一個月的時間的行合並到 d2 中的行。 為了使數據正確合並,我需要按人員和訪問日期進行合並。 最后,我想要一個數據集,其中只包含開始和結束日期包含每個訪問日期前一個月的行,並且我想保留所有列。 我需要一個不使用 data.table 的解決方案。
如果它有效:從 d 它會拉,
然后它將每組行合並到 d2 中相同值的人員和訪問日期(“一對多”合並)。
棘手的部分:如果人 1 的第 4 行與人 1 的第 5 行中包含的訪問日期前 1 個月重疊,我想拉第 4 行並將其附加到第二次訪問日期(但不要拉第 4 行和將其附加到第一次訪問日期 b/c 它不包含第一次訪問日期之前的月份)。
下面,我嘗試了這 4 種不同的方法,並注釋了與每種方法相關的錯誤消息。
#Load packages
pacman::p_load(dplyr, tidyr, lubridate, sqldf)
#Create data frame 1
#Create variables for data frame 1
person <- c(1, 1, 1, 1, 1, 2, 2, 2)
start <- c('2016-06-17', '2016-10-01', '2017-01-01', '2017-01-15', '2017-06-05', '2014-12-14', '2015-01-01', '2015-01-19')
end <- c('2016-09-30', '2016-12-31', '2017-01-14', '2017-06-04', '2017-09-03', '2014-12-31', '2015-01-18', '2015-07-03')
visit <- c(NA, NA, NA, '2017-01-15', '2017-08-01', NA, NA, '2015-02-22')
row <- c(1, 2, 3, 4, 5, 1, 2, 3)
#Populate data frame 1 with variables
d <- cbind(person, row)
d <- as.data.frame(d)
#Format dates and add to data frame 1
d$start <- as.Date(start, format = '%Y-%m-%d')
d$end <- as.Date(end, format = '%Y-%m-%d')
d$visit <- as.Date(visit, format = '%Y-%m-%d')
######
#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')
#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)
#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
#Need to merge conditionally such that only rows from d that contain time one month before 'visit' are selected, and then merged by both 'person' and 'visit' to d2 (the 'backbone' data set)
#Attempt 1 yields this error message:
#Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= :
# the condition has length > 1
if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$end)) {
left_join(x = d, y = d2, by = c('person' = 'person_2', 'visit' = 'visit_2'))
}
#> Error in if (((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= : the condition has length > 1
#Attempt 2 - Error: tinyformat: Too many conversion specifiers in format string
result = sqldf('
select *
from back left join d on
(((d$visit %m-% months(1)) >= d$start) & ((d$visit %m-% months(1)) <= d$EndDate))
AND d2.person_2 = d.person
And d2.visit_2 = d.visit
')
#> Error: tinyformat: Too many conversion specifiers in format string
#Attempt 3 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
FROM d2
LEFT JOIN d ON (visit %m-% months(1)) BETWEEN start and end')
#> Error: tinyformat: Too many conversion specifiers in format string
#Attempt 4 - Error: tinyformat: Too many conversion specifiers in format string
sqldf('SELECT *
FROM d2
LEFT JOIN
d on
d2.person_2 = d.person and
d2.visit_2 = d.visit and
(d$visit %m-% months(1) >= d$start) and
(d$visit %m-% months(1) <= d$end)')
#> Error: tinyformat: Too many conversion specifiers in format string
d
#> person row start end visit
#> 1 1 1 2016-06-17 2016-09-30 <NA>
#> 2 1 2 2016-10-01 2016-12-31 <NA>
#> 3 1 3 2017-01-01 2017-01-14 <NA>
#> 4 1 4 2017-01-15 2017-06-04 2017-01-15
#> 5 1 5 2017-06-05 2017-09-03 2017-08-01
#> 6 2 1 2014-12-14 2014-12-31 <NA>
#> 7 2 2 2015-01-01 2015-01-18 <NA>
#> 8 2 3 2015-01-19 2015-07-03 2015-02-22
由reprex 包(v2.0.1) 創建於 2022-05-13
#Create data frame 2
person_2 <- c(1, 1, 2)
visit_2 <- c('2017-01-15', '2017-08-01', '2015-02-22')
#Populate data frame 2
d2 <- cbind(person_2, visit_2)
d2 <- as.data.frame(d2)
#Format dates and add to data frame 2
d2$visit_2 <- as.Date(visit_2, format = '%Y-%m-%d')
d2
#> person_2 visit_2
#> 1 1 2017-01-15
#> 2 1 2017-08-01
#> 3 2 2015-02-22
由reprex 包(v2.0.1) 創建於 2022-05-13
d2 %>%
mutate(visit_2_m1 = visit_2 %m-% months(1)) %>%
fuzzyjoin::fuzzy_left_join(
d, ., by = c("start" = "visit_2_m1", "end" = "visit_2_m1"),
match_fun = list(`<=`, `>=`))
# person row start end visit person_2 visit_2 visit_2_m1
# 1 1 1 2016-06-17 2016-09-30 <NA> NA <NA> <NA>
# 2 1 2 2016-10-01 2016-12-31 <NA> 1 2017-01-15 2016-12-15
# 3 1 3 2017-01-01 2017-01-14 <NA> NA <NA> <NA>
# 4 1 4 2017-01-15 2017-06-04 2017-01-15 NA <NA> <NA>
# 5 1 5 2017-06-05 2017-09-03 2017-08-01 1 2017-08-01 2017-07-01
# 6 2 1 2014-12-14 2014-12-31 <NA> NA <NA> <NA>
# 7 2 2 2015-01-01 2015-01-18 <NA> NA <NA> <NA>
# 8 2 3 2015-01-19 2015-07-03 2015-02-22 2 2015-02-22 2015-01-22
您的問題的描述不清楚,我不確定我是否完全理解您想要實現的目標。 在上面的示例中應該返回多少行? 您只想將 d 中的行的子集連接到 d2 嗎?
這是一個僅使用 dplyr 的解決方案,這可能是您正在尋找的:
d %>%
mutate(
date_interval = start %--% end,
visit_within = visit %m-% months(1) %within% date_interval
) %>%
filter(visit_within == TRUE) %>%
left_join(d2, ., by = c("person_2" = "person"))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.