[英]R- merge dataframes based on recent dates
I have two dataframes:我有两个数据框:
In DF1, for every ID
, the param
have been recorded on various dates.在 DF1 中,对于每个
ID
, param
都记录在不同的日期。 In DF2, for every ID
, a number of dates are given.在 DF2 中,对于每个
ID
,都会给出多个日期。 For every ID
, I would like to get all the corresponding param
and value
from DF1, depending on the dates: either the value
that corresponds to the most recent date1
(in DF1
) before date2
(in DF2
) for a given param or If there is no such date1
, the most recent value
after date2
.对于每个
ID
,我想从 DF1 中获取所有相应的param
和value
,具体取决于日期:对于给定参数,或者与date2
(在DF2
中)之前的最新date1
(在DF1
中)相对应的value
,或者如果有没有这样的date1
, date2
之后的最新value
。
DF1
is (I have marked with * the correct rows for the result ): DF1
是(我用 * 标记了结果的正确行):
ID date1 param value
1 id1 1/1/2020 pA pA_1_1
2 id1 2/1/2020 pA pA_1_2 *
3 id1 17/1/2020 pA pA_1_3
4 id1 20/1/2020 pB pB_1_1 *
5 id1 21/1/2020 pB pB_1_2
6 id2 21/12/2022 pA pA_2_1 *
7 id2 22/12/2022 pA pA_2_2
8 id2 18/12/2022 pB pB_2_1 *
9 id2 19/12/2022 pB pB_2_2
DF2
is: DF2
是:
ID date2
1 id1 15/1/2020
2 id2 20/12/2020
The result should be:结果应该是:
ID date2 param value date1
1 id1 15/1/2020 pA pA_1_2 2/1/2020
2 id1 15/1/2020 pB pB_1_1 20/1/2020
3 id2 20/12/2020 pA pA_2_1 21/12/2022
4 id2 20/12/2020 pB pB_2_1 18/12/2022
Code to reproduce the DF1
and DF2
:重现
DF1
和DF2
的代码:
DF1= data.frame(
stringsAsFactors = FALSE,
ID = c("id1","id1","id1","id1",
"id1","id2","id2","id2","id2"),
date1 = c("1/1/2020","2/1/2020",
"17/1/2020","20/1/2020","21/1/2020","21/12/2022",
"22/12/2022","18/12/2022","19/12/2022"),
param = c("pA", "pA", "pA", "pB", "pB", "pA", "pA", "pB", "pB"),
value = c("pA_1_1","pA_1_2","pA_1_3",
"pB_1_1","pB_1_2","pA_2_1","pA_2_2","pB_2_1","pB_2_2")
)
DF2=data.frame(
stringsAsFactors = FALSE,
ID = c("id1", "id2"),
date2 = c("15/1/2020", "20/12/2020")
)
This is my solution.这是我的解决方案。 I'm sure there is a way to write this with less code (using one dataframe instead of two and later merging).
我确信有一种方法可以用更少的代码编写这个(使用一个 dataframe 而不是两个和以后合并)。 But I don't know righ now.
但我现在不知道。
library(tidyverse)
library(lubridate)
# Get before date2
before <- DF1 %>%
left_join(DF2,by = "ID") %>%
mutate(diff = dmy(date1)-dmy(date2)) %>%
mutate(Grp = data.table::rleid(param)) %>%
filter(diff < 0) %>%
group_by(Grp) %>%
filter(diff == max(diff)) %>%
ungroup
# Get after date2
after <- DF1 %>%
left_join(DF2,by = "ID") %>%
mutate(diff = dmy(date1)-dmy(date2)) %>%
mutate(Grp = data.table::rleid(param)) %>%
filter(diff > 0) %>%
group_by(Grp) %>%
filter(! Grp %in% before$Grp, diff == min(diff)) %>%
ungroup
result <- bind_rows(before,after) %>%
select(ID,date2, param, value, date1) %>%
arrange(ID, param)
Explanation: I'm using lubridate library to compare the dates.说明:我正在使用 lubridate 库来比较日期。 I do the same process to create two dataframes - first one (before df) for groups which accomplish first condition (closest date in DF1 before date2 in DF2), second one (after df) is for groups which do the other way round (closest date in DF1 after date2 in DF2).
我执行相同的过程来创建两个数据帧 - 第一个(在 df 之前)用于完成第一个条件的组(DF1 中最接近的日期在 DF2 中的 date2 之前),第二个(在 df 之后)用于相反的组(最近DF1 中的日期在 DF2 中的 date2 之后)。
I will explain first:我先解释一下:
# Get before date2
before <- DF1 %>%
left_join(DF2,by = "ID") %>%
mutate(diff = dmy(date1)-dmy(date2)) %>%
mutate(Grp = data.table::rleid(param)) %>%
filter(diff < 0) %>%
group_by(Grp) %>%
filter(diff == max(diff)) %>%
ungroup
Here, we merge DF1 and DF2 by ID, so rows with same ID have the same date2.在这里,我们通过 ID 合并 DF1 和 DF2,因此具有相同 ID 的行具有相同的 date2。 Then, we calculate the differences date1-date2 - first we convert characters to date using
dmy()
.然后,我们计算差异 date1-date2 - 首先我们使用
dmy()
将字符转换为日期。 Therefore, dates before date2 will be a negative difference.因此,date2 之前的日期将是负差。 With
data.table::rleid(param)
we enumerate subgroups with different ID & param, so we can know the subgroups.使用
data.table::rleid(param)
我们枚举具有不同 ID 和参数的子组,因此我们可以知道子组。 Then we can group by then and filter by them.然后我们可以按那时分组并按它们过滤。
At the end:在最后:
result <- bind_rows(before,after) %>%
select(ID,date2, param, value, date1) %>%
arrange(ID, param)
We bind the two dataframe by rows and select the columns you are looking for, to delete the columns we created to operate with (group and filter).我们按行绑定两个 dataframe 和 select 您要查找的列,以删除我们创建用于操作的列(组和过滤器)。 PS: I added arrange() to make sure the final df is sorted by ID and param values.
PS:我添加了安排()以确保最终的 df 按 ID 和参数值排序。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.