简体   繁体   English

使用分组的 dplyr 标识符在 2 个数据帧之间查找最匹配的数字

[英]Finding closest matching number between 2 dataframes using a grouped dplyr identifier

I have 2 datasets each with a 'Patient ID` and a collection date measured from the same date "from start."我有 2 个数据集,每个数据集都有一个“患者 ID”和一个从同一日期“从开始”开始测量的收集日期。 In order to join these dataframes together, I'd like to match each sample in d1 to it's closest neighbor in d2.为了将这些数据帧连接在一起,我想将 d1 中的每个样本与其在 d2 中最近的邻居相匹配。 How can this be done with a function in dplyr?如何使用 dplyr 中的函数完成此操作?

d1<-data.frame(`Patient ID`=c(rep("001",4),rep("002",5)),`fromstart`=c(-5,30,90,150,-10,15,45,100,250),check.names = F)
d2<-data.frame(`Patient ID`=c(rep("001",7),rep("002",4)),`fromstart`=c(-20,10,30,50,90,110,150,-10,15,45,100),check.names = F)

closest_date<-function(cases,d2) {
  return(d2 %>% select(`Patient ID`,fromstart) %>% unique() %>% filter(`Patient ID`==cases$`Patient ID`) %>% rowwise() %>% mutate(date_match=as.numeric(cases$fromstart[which.min(abs(fromstart - cases$fromstart))])))
}

d1 %>% select(`Patient ID`,fromstart) %>% unique() %>% group_by(`Patient ID`) %>% rowwise() %>% mutate(closest=closest_date(.,d2))

If I understood your problem correctly you want to join by patient ID and then select those lines where the difference between fromstart is the smallest?如果我正确理解您的问题,您想通过患者 ID 加入,然后选择 fromstart 之间差异最小的那些行? If so this would be a solution如果是这样,这将是一个解决方案

library(dplyr)
d1 %>% 
  dplyr::full_join(d2, by = c("Patient ID"), suffix = c("_1", "_2")) %>% 
  dplyr::mutate(DIF = abs(fromstart_1  - fromstart_2)) %>% 
  dplyr::group_by(`Patient ID`, fromstart_1) %>% 
  dplyr::filter(DIF == min(DIF))

As you can see this does not really work well if you want unique combinations because there can be cases where the distance is the same... than again maybe I did not get your questoin right如您所见,如果您想要独特的组合,这并不能很好地发挥作用,因为在某些情况下距离相同……也许我没有正确理解您的问题

Instead of using the absolute value you could filter for positive differences/distances as well if you want fromstart of the second table to be larger than from the first, this would reduce double entries to a certain degree如果您希望第二个表的起始值大于第一个表的起始值,则可以不使用绝对值来过滤正差异/距离,这会在一定程度上减少重复输入

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM