简体   繁体   English

合并匹配A,B和* * C的数据帧?

[英]Merge dataframes on matching A, B and *closest* C?

I have two dataframes like so: 我有两个这样的数据帧:

set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
                       y=c(1, 1, 1, 5, 4),
                       time=c(17, 12, 11.6, 22.5, 2),
                       val=letters[1:5],
                       stringsAsFactors=F)

I want to merge to.merge into df (with all.x=T ) such that: 我想合并到to.mergedf (使用all.x=T ),这样:

  • df$x == to.merge$x AND df$x == to.merge$x AND
  • df$y == to.merge$y AND df$y == to.merge$y AND
  • abs(df$time - to.merge$time) <= 1 ; abs(df$time - to.merge$time) <= 1 ; in the case of multiple to.merge that satisfy, we pick the one that minimises this distances. 在满足多个to.merge的情况下,我们选择最小化这个距离的那个。

How can I do this? 我怎样才能做到这一点?

So my desired result is (this is just df with the corresponding value column of to.merge added for matching rows): 所以我想要的结果是(这只是df ,并为匹配行添加了to.merge的相应value列):

   x y time val
1  1 1    8  NA
2  2 1   11   c
3  3 1   17  NA
4  1 2   27  NA
5  2 2    6  NA
6  3 2   27  NA
7  1 3   28  NA
8  2 3   20  NA
9  3 3   19  NA
10 1 4    2  NA
11 2 4    6  NA
12 3 4    5  NA
13 1 5   21  NA
14 2 5   12  NA
15 3 5   23   d

where to.merge was: to.merge在哪里:

  x y time val
1 2 1 17.0   a
2 2 1 12.0   b
3 2 1 11.6   c
4 3 5 22.5   d
5 2 4  2.0   e

Note - (2, 1, 17, a) didn't match into df because the time 17 was more than 1 away from df$time 11 for (X, Y) = (2, 1). 注意 - (2,1,17,a)与df不匹配,因为对于(X,Y)=(2,1), time 17与df$time 11的距离大于1。

Also, there were two rows in to.merge that satisfied the condition for matching to df 's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time was the closest to 11. 另外, to.merge中有两行满足匹配dfto.merge )行的条件,但是'c'行被选中而不是'b'行,因为它的time最接近到11。

Finally, there may be rows in to.merge that do not match anything in df . 最后, to.merge中的行可能与df中的任何内容都不匹配。


One way that works is a for-loop, but it takes far too long for my data ( df has ~12k rows and to.merge has ~250k rows) 一种工作方式是for循环,但是对于我的数据来说需要太长时间( df有~12k行而to.merge有~250k行)

df$value <- NA
for (i in 1:nrow(df)) {
    row <- df[i, ]
    idx <- which(row$x == to.merge$x &
                 row$y == to.merge$y &
                 abs(row$time - to.merge$time) <= 1)
    if (length(idx)) {
        j <- idx[which.min(row$time - to.merge$time[idx])]
        df$val[i] <- to.merge$val[j]
    }
}

I feel that I can somehow do a merge, like: 我觉得我可以以某种方式进行合并,例如:

to.merge$closest_time_in_df <- sapply(to.merge$time,
                                  function (tm) {
                                     dts <- abs(tm - df$time)
                                     # difference must be at most 1
                                     if (min(dts) <= 1) {
                                         df$time[which.min(dts)]
                                     } else {
                                         NA
                                     }
                                  })
merge(df, to.merge,
      by.x=c('x', 'y', 'time'),
      by.y=c('x', 'y', 'closest_time_in_df'),
      all.x=T)

But this doesn't merge the (2, 1, 11) row because to.merge$closest_time_in_df for (2, 1, 11.5, c) is 12, but a time of 12 in df corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails. 但是这并没有合并(2, 1, 11) to.merge$closest_time_in_df (2, 1, 11)行,因为( to.merge$closest_time_in_df (2, 1, 11.5, c) to.merge$closest_time_in_df是12,但df中12的时间对应于(x,y)=( 2,5)不是(2,1)因此合并失败。

Use data.table and roll='nearest' or to limit to 1, roll = 1, rollends = c(TRUE,TRUE) 使用data.tableroll='nearest'或限制为1, roll = 1, rollends = c(TRUE,TRUE)

eg 例如

library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))

# use join syntax with roll = 'nearest'


tm[DT, roll='nearest']

#     x y time val
#  1: 1 1    8  NA
#  2: 1 2   27  NA
#  3: 1 3   28  NA
#  4: 1 4    2  NA
#  5: 1 5   21  NA
#  6: 2 1   11   c
#  7: 2 2    6  NA
#  8: 2 3   20  NA
#  9: 2 4    6   e
# 10: 2 5   12  NA
# 11: 3 1   17  NA
# 12: 3 2   27  NA
# 13: 3 3   19  NA
# 14: 3 4    5  NA
# 15: 3 5   23   d

You can limit your self to looking forward and back (1) by setting roll=-1 and rollends = c(TRUE,TRUE) 你可以通过设置roll=-1rollends = c(TRUE,TRUE)来限制你的自我前瞻和后退(1 rollends = c(TRUE,TRUE)

new <- tm[DT, roll=-1, rollends  =c(TRUE,TRUE)]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join) 或者你可以先滚动= 1,然后滚动= -1,然后合并结果(整理第二个滚动连接的val.1列)

new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
    x y time val
 1: 1 1    8  NA
 2: 1 2   27  NA
 3: 1 3   28  NA
 4: 1 4    2  NA
 5: 1 5   21  NA
 6: 2 1   11   c
 7: 2 2    6  NA
 8: 2 3   20  NA
 9: 2 4    6  NA
10: 2 5   12  NA
11: 3 1   17  NA
12: 3 2   27  NA
13: 3 3   19  NA
14: 3 4    5  NA
15: 3 5   23   d

Using merge couple of times and aggregate once, here is how to do it. 使用merge几次并aggregate一次,这是如何做到的。

set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)

#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
##   x y time.x val time.y dif
## 1 2 1   17.0   a     11 6.0
## 2 2 1   12.0   b     11 1.0
## 3 2 1   11.6   c     11 0.6
## 4 2 4    2.0   e      6 4.0
## 5 3 5   22.5   d     23 0.5

#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
##   x y dif time.x val time.y
## 1 2 1 0.6   11.6   c     11
## 2 2 4 4.0    2.0   e      6
## 3 3 5 0.5   22.5   d     23

#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
##    x y time  val
## 1  1 1    8 <NA>
## 2  1 2   27 <NA>
## 3  1 3   28 <NA>
## 4  1 4    2 <NA>
## 5  1 5   21 <NA>
## 6  2 1   11    c
## 7  2 2    6 <NA>
## 8  2 3   20 <NA>
## 9  2 4    6 <NA>
## 10 2 5   12 <NA>
## 11 3 1   17 <NA>
## 12 3 2   27 <NA>
## 13 3 3   19 <NA>
## 14 3 4    5 <NA>
## 15 3 5   23    d

mnel's answer uses roll = "nearest" in a data.table join but does not limit to +/- 1 as requested by the OP. mnel的答案data.table连接中使用roll = "nearest" ,但不限制为OP请求的+/- 1。 In addition, MichaelChirico has suggested to use the on parameter. 此外, MichaelChirico建议使用on参数。

This approach uses 这种方法使用

  • roll = "nearest" , roll = "nearest"
  • an update by reference , ie, without copying, 通过引用更新,即无需复制,
  • setDT() to coerce a data.frame to data.table without copying (introduced 2014-02-27 with v.1.9.2 of data.table ), setDT()将data.frame强制转换为data.table而不进行复制(引入2014-02-27与v.1.9.2 of data.table ),
  • the on parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6). 用于显式设置密钥的on参数(引入2015-09-19 with v.1.9.6)。

So, the code below 那么,下面的代码

library(data.table)   # version 1.11.4 used
setDT(df)[setDT(to.merge), on  = .(x, y, time), roll = "nearest",
          val := replace(val, abs(x.time - i.time) > 1, NA)]
df

has updated df : 已更新df

  xy time val 1: 1 1 8 <NA> 2: 2 1 11 c 3: 3 1 17 <NA> 4: 1 2 27 <NA> 5: 2 2 6 <NA> 6: 3 2 27 <NA> 7: 1 3 28 <NA> 8: 2 3 20 <NA> 9: 3 3 19 <NA> 10: 1 4 2 <NA> 11: 2 4 6 <NA> 12: 3 4 5 <NA> 13: 1 5 21 <NA> 14: 2 5 12 <NA> 15: 3 5 23 d 

Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer ) 请注意,行的顺序没有改变(与Chinmay Patil的答案形成对比)

In case df must not be changed, a new data.table can be created by 如果不能更改df ,可以创建新的data.table

result <- setDT(to.merge)[setDT(df), on  = .(x, y, time), roll = "nearest",
                .(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
result

which returns the same result as above. 返回与上面相同的结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM