简体   繁体   中英

merge two data sets based on the information from two columns

I have two large data sets like these:

df1 <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'))

df2 <- data.frame(subject = c(rep(1, 10), rep(2, 10)), day =c(1,1,2,3,9,12,15,15,16,17,1,1,2,3,9,13,15,15,16,17),dtime=c('4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/25/2012 7:15','4/28/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/2/2012 7:00','5/6/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45'))

...

I want to merge the two data sets so that the 'dtime' in df2 could match the 'subject' and 'day' in df1, and fill out the missing value with '.' in df1, the row number should be the same as df1. The expected output should look like this:

merged <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'),dtime =c('.','.','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','.','.','.','.','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','.','.'))

...

I tried to use merge(df1, df2, by = c('subject', 'day')) , but its not working well, it produced extra rows that I do not want.

Does anyone have idea about realizing this?

This seems to work.

result <- merge(df1,unique(df2),by=c("subject","day"),all.x=T)
result$dtime <- as.character(result$dtime)
result[is.na(result$dtime),]$dtime="."

Some notes:

  1. You don't need the by=... argument in merge(...) because the default is to merge on all common columns (which, in your case, are subject and day ). I included it for clarity.
  2. The other answer produced extra columns because some of the rows in df2 are duplicated. In this case we can deal with that using unique(...) , but usually this is a symptom of a bigger problem. You should really look into why there are duplicated rows...
  3. The way you have it set up, dtime is a factor. You have to convert that to character before you can set the NA's to something else.

Finally, if your datasets are indeed large (millions of rows), then consider using data tables. This will be much faster .

library(data.table)
dt1 <- data.table(df1,key="subject,day")
dt2 <- data.table(unique(df2),key="subject,day")
result <- dt2[dt1]
result[is.na(dtime),dtime:="."]
head(result)
#    subject day          dtime          stime
# 1:       1   0              . 4/16/2012 6:25
# 2:       1   0              . 4/16/2012 7:01
# 3:       1   1 4/17/2012 7:15 4/17/2012 7:22
# 4:       1   1 4/17/2012 7:15 4/17/2012 7:45
# 5:       1   1 4/17/2012 7:15 4/17/2012 8:13
# 6:       1   2 4/17/2012 7:15 4/18/2012 6:50

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM