简体   繁体   English

在data.frames之间只匹配一次日期和id

[英]Match dates and id between data.frames only once

Have 2 example databases as follows 有两个示例数据库,如下所示

id<-c(1,2,3,1,4,3,5)
date<-c("2011-1-1","2011-1-1","2011-2-2","2012-3-3","2012-4-4","2012-5-5","2012-6-6")
d<-data.frame(cbind(id,date))
colnames(d)<-c("id","date")
d$w<-do.call(paste,c(d[c("id","date")],sep=" "))

id<-c(7,8,9,10,7,10,8,10,11,12)
date<-c("2011-1-1","2011-1-1","2011-2-2","2012-3-3","2012-3-3","2012-4-4","2012-4-4","2012-5-5","2012-6-6","2012-6-6")
contr<-data.frame(cbind(id,date))
colnames(contr)<-c("id","date")
contr$w<-do.call(paste,c(contr[c("id","date")],sep=" "))

Consider that id and dates are repeated in both datasets but d$id are all different from contr$id and that all contr$date are %in% d$date What I want is y that is a vector including ONE contr$w FOR EACH d$id that have a contr$date%in%d$date 考虑到两个数据集中都重复了id和date,但是d $ id与contr $ id不同,并且所有contr $ date是%in%d $ date我想要的是一个包含一个contr $ w FOR EACH的向量d $ id具有控制权的日期%in%d $ date

I have tried this which does not work but I am sure there must be a much easier,simpler=better way to do it. 我已经尝试了这种方法,但没有用,但我确信必须有一种更简单,更简单的更好方法。

y<-0
for(i in length(levels(factor(d$w)))){
   for(j in length(levels(factor(contr$w)))){
     z<-ifelse(d$date[i]==contr$date[j],contr$w[j],NA)
     y<-c(y,z)
     y<-subset(y,!is.na(y))
  }
}

Anyone can help? 有人可以帮忙吗? Many thanks, Marco 非常感谢Marco

This did what I wanted, maybe I was not clear enough in my explanation. 这符合我的要求,也许我的解释不够清楚。 I just wanted a random date per id (then I can create the w column). 我只是想要每个id一个随机的日期(然后我可以创建w列)。 I have sorted this by using a solution from this other question: 我使用另一个问题的解决方案对此进行了排序:

Random row selection in R R中的随机行选择

Many t hanks for the effort anyway! 无论如何,非常感谢您的努力! Marco 马可

Actually I have written now a loop that does this (the previous answer did not work as some cases in d did not have a matching date in contr). 实际上,我现在已经编写了一个执行此操作的循环(先前的答案无效,因为d中的某些情况在contr中没有匹配的日期)。 It is very slow but it does exactly what I wanted 速度很慢,但确实满足我的要求

for(i in 1:length(d$rownames)){
   if(TRUE%in%levels(factor(contr$w%in%d$w[i]))==TRUE){
       control.2$rownames[i]<-sample(contr$rownames[ctr$w==d$w[i]],1)
       contr<-contr[!contr$rownames%in%control.2$rownames[i],]
}else{
       z<-contr[contr$practice==d$practice[i],]
       z$tempo<-abs(difftime(z$date,d$date[i],units="days"))
       z<-z[!is.na(z$tempo),]
       z<-z[z$tempo==min(z$tempo),]
       control.2$rownames[i]<-sample(z$rownames,1)
       contr<-contr[!contr$rownames%in%control.2$rownames[i],]
  }
}

Not the best code I am sure, but it works. 我确定不是最好的代码,但是它可以工作。 The second look accounts for the few cases where there was no case with a matching date so I chose the sampled() one with the closest date. 第二种外观说明了没有匹配日期的案例的少数情况,因此我选择了具有最接近日期的sampled()。 If you can come up with a faster version, that would be nice. 如果您可以提出一个更快的版本,那将是很好的。 My datasets are about d=~5K rows and contr=~2.5 million rows and it takes roughly 2 hours to run. 我的数据集大约是d =〜5K行和contr =〜250万行,大约需要2个小时才能运行。 Painful but worth the wait! 痛苦但值得等待!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM