merge two data sets based on the information from two columns

Question

I have two large data sets like these:

df1 <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'))

df2 <- data.frame(subject = c(rep(1, 10), rep(2, 10)), day =c(1,1,2,3,9,12,15,15,16,17,1,1,2,3,9,13,15,15,16,17),dtime=c('4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/25/2012 7:15','4/28/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/2/2012 7:00','5/6/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45'))

...

I want to merge the two data sets so that the 'dtime' in df2 could match the 'subject' and 'day' in df1, and fill out the missing value with '.' in df1, the row number should be the same as df1. The expected output should look like this:

merged <- data.frame(subject = c(rep(1, 15), rep(2, 14)), day =c(0,0,1,1,1,2,3,15,15,16,16,17,17,18,19,0,0,1,1,2,3,15,15,16,16,17,17,18,19),stime=c('4/16/2012 6:25','4/16/2012 7:01','4/17/2012 7:22','4/17/2012 7:45','4/17/2012 8:13','4/18/2012 6:50','4/19/2012 6:55','5/1/2012 6:28','5/1/2012 7:00','5/2/2012 6:28','5/2/2012 7:00','5/3/2012 6:22','5/3/2012 7:00','5/4/2012 6:26','5/5/2012 6:47','4/23/2012 5:56','4/23/2012 6:30','4/24/2012 6:55','4/24/2012 7:20','4/25/2012 6:32','4/26/2012 6:28','5/8/2012 5:54','5/8/2012 6:30','5/9/2012 5:55','5/9/2012 6:30','5/10/2012 5:55','5/10/2012 6:30','5/11/2012 6:41','5/12/2012 6:46'),dtime =c('.','.','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','4/17/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','5/1/2012 7:15','.','.','.','.','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','4/24/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','5/8/2012 6:45','.','.'))

...

I tried to use merge(df1, df2, by = c('subject', 'day')) , but its not working well, it produced extra rows that I do not want.

Does anyone have idea about realizing this?

Answer 1

This seems to work.

result <- merge(df1,unique(df2),by=c("subject","day"),all.x=T)
result$dtime <- as.character(result$dtime)
result[is.na(result$dtime),]$dtime="."

Some notes:

You don't need the by=... argument in merge(...) because the default is to merge on all common columns (which, in your case, are subject and day ). I included it for clarity.
The other answer produced extra columns because some of the rows in df2 are duplicated. In this case we can deal with that using unique(...) , but usually this is a symptom of a bigger problem. You should really look into why there are duplicated rows...
The way you have it set up, dtime is a factor. You have to convert that to character before you can set the NA's to something else.

Finally, if your datasets are indeed large (millions of rows), then consider using data tables. This will be much faster .

library(data.table)
dt1 <- data.table(df1,key="subject,day")
dt2 <- data.table(unique(df2),key="subject,day")
result <- dt2[dt1]
result[is.na(dtime),dtime:="."]
head(result)
#    subject day          dtime          stime
# 1:       1   0              . 4/16/2012 6:25
# 2:       1   0              . 4/16/2012 7:01
# 3:       1   1 4/17/2012 7:15 4/17/2012 7:22
# 4:       1   1 4/17/2012 7:15 4/17/2012 7:45
# 5:       1   1 4/17/2012 7:15 4/17/2012 8:13
# 6:       1   2 4/17/2012 7:15 4/18/2012 6:50

merge two data sets based on the information from two columns

Question

1 answers

solution1
2 ACCPTED 2014-04-05 21:21:08

merge two data sets based on the information from two columns

Question

1 answers

solution1 2 ACCPTED 2014-04-05 21:21:08

solution1
2 ACCPTED 2014-04-05 21:21:08