简体   繁体   English

在 R 中合并重叠的数据帧

[英]Merging overlapping dataframes in R

Okay, so I have two different data frames (df1 and df2) which, to simplify it, have an ID, a date, and the score on a test.好的,所以我有两个不同的数据框(df1 和 df2),为了简化它,它们具有 ID、日期和测试分数。 In each data frame the person (ID) have taken the test on multiple dates.在每个数据框中,人 (ID) 在多个日期接受了测试。 When looking between the two data frames, some of the people are listed in df1 but not in df2, and vice versa, but some are listed in both and they can overlap differently.在两个数据框之间查看时,有些人列在 df1 中,但未列在 df2 中,反之亦然,但有些人同时列在这两个数据框中,并且它们可以不同地重叠。

I want to combine all the data into one frame, but the tricky part is if any of the IDs and scores from df1 and df2 are within 7 days (I can do this with a subtracted dates column), I want to combine that row.我想将所有数据合并到一个框架中,但棘手的部分是如果来自 df1 和 df2 的任何 ID 和分数都在 7 天内(我可以用减去日期列来做到这一点),我想合并那一行。

In essence, for every ID there will be one row with both scores written separately if taken within 7 days, and if not it will make two separate rows, one with score from df1 and one from df2 along with all the other scores that might not be listed in both.从本质上讲,对于每个 ID,如果在 7 天内取下,将有一行分别写下两个分数,如果不是,则将分成两行,一个来自 df1 的分数,一个来自 df2 的分数以及所有其他可能不会的分数列在两者中。

EX:前任:

df1 df1

ID Date1(yyyymmdd) Score1
1  20140512        50
1  20140501        30
1  20140703        50
1  20140805        20
3  20140522        70
3  20140530        10

df2 df2

ID Date2(yyyymmdd) Score2
1  20140530        40
1  20140622        20
1  20140702        10
1  20140820        60
2  20140522        30
2  20140530        80

Wanted_df通缉_df

ID Date1(yyyymmdd) Score1 Date2(yyyymmdd) Score2
1  20140512        50                     
1  20140501        30
1  20140703        50     20140702        10
1  20140805        20
1                         20140530        40
1                         20140622        20
1                         20140820        60
3  20140522        70
3  20140530        10
2                         20140522        30
2                         20140530        80

Use an outer join with an absolute value limit on the date difference.使用对日期差异具有绝对值限制的外部联接 (A outer join B keeps all rows of A and B.) For example: 外连接B 保留 A 和 B 的所有行。)例如:

library(sqldf)
sqldf("select a.*, b.* from df1 a outer join df2 b on a.ID = b.ID and abs(a.Date1 - b.Date2) <=7")

Note that your date variables will have to be true dates.请注意,您的日期变量必须是真实日期。 If they are currently characters or integers, you need to do something like df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D) etc.如果它们当前是字符或整数,则需要执行诸如df1$Date1 <- as.Date(as.character(df$Date1), format="%Y%M%D)等操作。

Alright.好吧。 I feel bad about the bogus outer join answer (which may be possible in a library I don't know about, but there are advantages to using RDBMS sometimes...) so here is a hacky workaround.我对虚假的外连接答案感到难过(这在我不知道的库中可能是可能的,但有时使用 RDBMS 有好处......)所以这里有一个 hacky 解决方法。 It assumes that all the joins will be at most one to one, which you've said is OK.它假设所有连接最多是一对一的,你说这是可以的。

# ensure the date columns are date type
df1$Date1 <- as.Date(as.character(df1$Date1), format="%Y%m%d")
df2$Date2 <- as.Date(as.character(df2$Date2), format="%Y%m%d")

# ensure the dfs are sorted 
df1 <- df1[order(df1$ID, df1$Date1),]
df2 <- df2[order(df2$ID, df2$Date2),]

# initialize the output df3, which starts as everything from df1 and NA from df2
df3 <- cbind(df1,Date2=NA, Score2=NA)

library(plyr) #for rbind.fill

for (j in 1:nrow(df2)){
  # see if there are any rows of test1 you could join test2 to
  join_rows <- which(df3[,"ID"]==df2[j,"ID"] & abs(df3[,"Date1"]-df2[j,"Date2"])<7 )
  # if so, join it to the first one (see discussion)
  if(length(join_rows)>0){
    df3[min(join_rows),"Date2"] <- df2[j,"Date2"]
    df3[min(join_rows),"Score2"] <- df2[j,"Score2"]
  } # if not, add a new row of just the test2
  else df3 <- rbind.fill(df3,df2[j,])
}
df3 <- df3[order(df3$ID,df3$Date1,df3$Date2),]
row.names(df3)<-NULL # i hate these
df3  
#    ID      Date1 Score1      Date2 Score2
# 1   1 2014-05-01     30       <NA>     NA
# 2   1 2014-05-12     50       <NA>     NA
# 3   1 2014-07-03     50 2014-07-02     10
# 4   1 2014-08-05     20       <NA>     NA
# 5   1       <NA>     NA 2014-05-30     40
# 6   1       <NA>     NA 2014-06-22     20
# 7   1       <NA>     NA 2014-08-20     60
# 8   2       <NA>     NA 2014-05-22     30
# 9   2       <NA>     NA 2014-05-30     80
# 10  3 2014-05-22     70       <NA>     NA
# 11  3 2014-05-30     10       <NA>     NA

I couldn't get the rows in the same sort order as yours, but they look the same.我无法以与您相同的排序顺序获取行,但它们看起来相同。

Short explanation: For each row in df2, see if there's a row in df1 you can "join" it to.简短说明:对于 df2 中的每一行,查看 df1 中是否有一行您可以“加入”它。 If not, stick it at the bottom of the table.如果没有,请将其贴在桌子底部。 In the initialization and rbinding, you'll see some hacky ways of assigning blank rows or columns as placeholders.在初始化和 rbinding 中,您会看到一些将空白行或列指定为占位符的 hacky 方法。

Why this is a bad hacky workaround: for large data sets, the rbinding of df3 to itself will consume more and more memory.为什么这是一个糟糕的解决方法:对于大型数据集,df3 与自身的 rbinding 将消耗越来越多的内存。 The loop is definitely not optimal and its search does not exploit the fact that the tables are sorted.循环绝对不是最优的,它的搜索没有利用表已排序的事实。 If by some chance the test were taken twice within a week, you would see some unexpected behavior (duplicates from df2, etc).如果有机会在一周内进行两次测试,您会看到一些意外行为(来自 df2 等的重复)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM