简体   繁体   English

R-基于最近日期合并数据框

[英]R- merge dataframes based on recent dates

I have two dataframes:我有两个数据框:

In DF1, for every ID , the param have been recorded on various dates.在 DF1 中,对于每个IDparam都记录在不同的日期。 In DF2, for every ID , a number of dates are given.在 DF2 中,对于每个ID ,都会给出多个日期。 For every ID , I would like to get all the corresponding param and value from DF1, depending on the dates: either the value that corresponds to the most recent date1 (in DF1 ) before date2 (in DF2 ) for a given param or If there is no such date1 , the most recent value after date2 .对于每个ID ,我想从 DF1 中获取所有相应的paramvalue ,具体取决于日期:对于给定参数,或者与date2 (在DF2中)之前的最新date1 (在DF1中)相对应的value ,或者如果有没有这样的date1date2之后的最新value

DF1 is (I have marked with * the correct rows for the result ): DF1是(我用 * 标记了结果的正确行):

  ID      date1 param  value
1 id1   1/1/2020    pA pA_1_1
2 id1   2/1/2020    pA pA_1_2 *
3 id1  17/1/2020    pA pA_1_3
4 id1  20/1/2020    pB pB_1_1 *
5 id1  21/1/2020    pB pB_1_2
6 id2 21/12/2022    pA pA_2_1 *
7 id2 22/12/2022    pA pA_2_2 
8 id2 18/12/2022    pB pB_2_1 *
9 id2 19/12/2022    pB pB_2_2 

DF2 is: DF2是:

   ID      date2
1 id1  15/1/2020
2 id2 20/12/2020

The result should be:结果应该是:

   ID      date2 param  value      date1
1 id1  15/1/2020    pA pA_1_2   2/1/2020
2 id1  15/1/2020    pB pB_1_1  20/1/2020
3 id2 20/12/2020    pA pA_2_1 21/12/2022
4 id2 20/12/2020    pB pB_2_1 18/12/2022

Code to reproduce the DF1 and DF2 :重现DF1DF2的代码:

DF1= data.frame(
  stringsAsFactors = FALSE,
                ID = c("id1","id1","id1","id1",
                       "id1","id2","id2","id2","id2"),
             date1 = c("1/1/2020","2/1/2020",
                       "17/1/2020","20/1/2020","21/1/2020","21/12/2022",
                       "22/12/2022","18/12/2022","19/12/2022"),
             param = c("pA", "pA", "pA", "pB", "pB", "pA", "pA", "pB", "pB"),
             value = c("pA_1_1","pA_1_2","pA_1_3",
                       "pB_1_1","pB_1_2","pA_2_1","pA_2_2","pB_2_1","pB_2_2")
)

DF2=data.frame(
  stringsAsFactors = FALSE,
                ID = c("id1", "id2"),
             date2 = c("15/1/2020", "20/12/2020")
)

This is my solution.这是我的解决方案。 I'm sure there is a way to write this with less code (using one dataframe instead of two and later merging).我确信有一种方法可以用更少的代码编写这个(使用一个 dataframe 而不是两个和以后合并)。 But I don't know righ now.但我现在不知道。

library(tidyverse)
library(lubridate)
# Get before date2
before <-  DF1 %>%
  left_join(DF2,by = "ID") %>% 
  mutate(diff = dmy(date1)-dmy(date2)) %>% 
  mutate(Grp = data.table::rleid(param)) %>%
  filter(diff < 0) %>%
  group_by(Grp) %>%
  filter(diff == max(diff)) %>% 
  ungroup
# Get after date2
after <- DF1 %>%
  left_join(DF2,by = "ID") %>% 
  mutate(diff = dmy(date1)-dmy(date2)) %>% 
  mutate(Grp = data.table::rleid(param)) %>%
  filter(diff > 0) %>%
  group_by(Grp) %>%
  filter(! Grp %in% before$Grp, diff == min(diff)) %>% 
  ungroup

result <- bind_rows(before,after) %>% 
  select(ID,date2, param, value, date1) %>%
  arrange(ID, param)
 

Explanation: I'm using lubridate library to compare the dates.说明:我正在使用 lubridate 库来比较日期。 I do the same process to create two dataframes - first one (before df) for groups which accomplish first condition (closest date in DF1 before date2 in DF2), second one (after df) is for groups which do the other way round (closest date in DF1 after date2 in DF2).我执行相同的过程来创建两个数据帧 - 第一个(在 df 之前)用于完成第一个条件的组(DF1 中最接近的日期在 DF2 中的 date2 之前),第二个(在 df 之后)用于相反的组(最近DF1 中的日期在 DF2 中的 date2 之后)。

I will explain first:我先解释一下:

# Get before date2

    before <-  DF1 %>%
    left_join(DF2,by = "ID") %>% 
    mutate(diff = dmy(date1)-dmy(date2)) %>% 
    mutate(Grp = data.table::rleid(param)) %>%
    filter(diff < 0) %>%
    group_by(Grp) %>%
    filter(diff == max(diff)) %>% 
    ungroup

Here, we merge DF1 and DF2 by ID, so rows with same ID have the same date2.在这里,我们通过 ID 合并 DF1 和 DF2,因此具有相同 ID 的行具有相同的 date2。 Then, we calculate the differences date1-date2 - first we convert characters to date using dmy() .然后,我们计算差异 date1-date2 - 首先我们使用dmy()将字符转换为日期。 Therefore, dates before date2 will be a negative difference.因此,date2 之前的日期将是负差。 With data.table::rleid(param) we enumerate subgroups with different ID & param, so we can know the subgroups.使用data.table::rleid(param)我们枚举具有不同 ID 和参数的子组,因此我们可以知道子组。 Then we can group by then and filter by them.然后我们可以按那时分组并按它们过滤。

At the end:在最后:

result <- bind_rows(before,after) %>% 
  select(ID,date2, param, value, date1) %>%
  arrange(ID, param)

We bind the two dataframe by rows and select the columns you are looking for, to delete the columns we created to operate with (group and filter).我们按行绑定两个 dataframe 和 select 您要查找的列,以删除我们创建用于操作的列(组和过滤器)。 PS: I added arrange() to make sure the final df is sorted by ID and param values. PS:我添加了安排()以确保最终的 df 按 ID 和参数值排序。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM