简体   繁体   中英

Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test. In each data frame the person (ID) have taken the test on multiple dates. When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.

I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.

In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.

EX:

df1

ID Date1(yyyymmdd) Score1
1  20140512        50
1  20140501        30
1  20140703        50
1  20140805        20
3  20140522        70
3  20140530        10

df2

ID Date2(yyyymmdd) Score2
1  20140530        40
1  20140622        20
1  20140702        10
1  20140820        60
2  20140522        30
2  20140530        80

Wanted_df

ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1  20140512        50                     
1  20140501        30
1  20140703        50     20140702        10
1  20140805        20
1                         20140530        40
1                         20140622        20
1                         20140820        60
3  20140522        70
3  20140530        10
2                         20140522        30
2                         20140530        80

Use an outer join with an absolute value limit on the date difference. (A outer join B keeps all rows of A and B.) For example:

library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")

Note that your date variables will have to be true dates. If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.

Alright. I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround. It assumes that all the joins will be at most one to one, which you've said is OK.

# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")

# ensure the dfs are sorted 
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]

# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)

library(plyr) #for rbind.fill

for (j in 1:nrow(df2)){
  # see if there are any rows of test1 you could join test2 to
  join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
  # if so, join it to the first one (see discussion)
  if(length(join_rows)>0){
    df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
    df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
  } # if not, add a new row of just the test2
  else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3  
#    ID      Date1 Score1      Date2 Score2
# 1   1 2014-05-01     30       <NA>     NA
# 2   1 2014-05-12     50       <NA>     NA
# 3   1 2014-07-03     50 2014-07-02     10
# 4   1 2014-08-05     20       <NA>     NA
# 5   1       <NA>     NA 2014-05-30     40
# 6   1       <NA>     NA 2014-06-22     20
# 7   1       <NA>     NA 2014-08-20     60
# 8   2       <NA>     NA 2014-05-22     30
# 9   2       <NA>     NA 2014-05-30     80
# 10  3 2014-05-22     70       <NA>     NA
# 11  3 2014-05-30     10       <NA>     NA

I couldn't get the rows in the same sort order as yours, but they look the same.

Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to. If not, stick it at the bottom of the table. In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.

Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory. The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted. If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM