[英]Merge dataframes on matching A, B and *closest* C?
I have two dataframes like so: 我有两个这样的数据帧:
set.seed(1)
df <- cbind(expand.grid(x=1:3, y=1:5), time=round(runif(15)*30))
to.merge <- data.frame(x=c(2, 2, 2, 3, 2),
y=c(1, 1, 1, 5, 4),
time=c(17, 12, 11.6, 22.5, 2),
val=letters[1:5],
stringsAsFactors=F)
I want to merge to.merge
into df
(with all.x=T
) such that: 我想合并到
to.merge
到df
(使用all.x=T
),这样:
df$x == to.merge$x
AND df$x == to.merge$x
AND df$y == to.merge$y
AND df$y == to.merge$y
AND abs(df$time - to.merge$time) <= 1
; abs(df$time - to.merge$time) <= 1
; in the case of multiple to.merge
that satisfy, we pick the one that minimises this distances. to.merge
的情况下,我们选择最小化这个距离的那个。 How can I do this? 我怎样才能做到这一点?
So my desired result is (this is just df
with the corresponding value
column of to.merge
added for matching rows): 所以我想要的结果是(这只是
df
,并为匹配行添加了to.merge
的相应value
列):
x y time val
1 1 1 8 NA
2 2 1 11 c
3 3 1 17 NA
4 1 2 27 NA
5 2 2 6 NA
6 3 2 27 NA
7 1 3 28 NA
8 2 3 20 NA
9 3 3 19 NA
10 1 4 2 NA
11 2 4 6 NA
12 3 4 5 NA
13 1 5 21 NA
14 2 5 12 NA
15 3 5 23 d
where to.merge
was: to.merge
在哪里:
x y time val
1 2 1 17.0 a
2 2 1 12.0 b
3 2 1 11.6 c
4 3 5 22.5 d
5 2 4 2.0 e
Note - (2, 1, 17, a) didn't match into df
because the time
17 was more than 1 away from df$time
11 for (X, Y) = (2, 1). 注意 - (2,1,17,a)与
df
不匹配,因为对于(X,Y)=(2,1), time
17与df$time
11的距离大于1。
Also, there were two rows in to.merge
that satisfied the condition for matching to df
's (2, 1, 11) row, but the 'c' row was picked instead of the 'b' row because its time
was the closest to 11. 另外,
to.merge
中有两行满足匹配df
( to.merge
)行的条件,但是'c'行被选中而不是'b'行,因为它的time
最接近到11。
Finally, there may be rows in to.merge
that do not match anything in df
. 最后,
to.merge
中的行可能与df
中的任何内容都不匹配。
One way that works is a for-loop, but it takes far too long for my data ( df
has ~12k rows and to.merge
has ~250k rows) 一种工作方式是for循环,但是对于我的数据来说需要太长时间(
df
有~12k行而to.merge
有~250k行)
df$value <- NA
for (i in 1:nrow(df)) {
row <- df[i, ]
idx <- which(row$x == to.merge$x &
row$y == to.merge$y &
abs(row$time - to.merge$time) <= 1)
if (length(idx)) {
j <- idx[which.min(row$time - to.merge$time[idx])]
df$val[i] <- to.merge$val[j]
}
}
I feel that I can somehow do a merge, like: 我觉得我可以以某种方式进行合并,例如:
to.merge$closest_time_in_df <- sapply(to.merge$time,
function (tm) {
dts <- abs(tm - df$time)
# difference must be at most 1
if (min(dts) <= 1) {
df$time[which.min(dts)]
} else {
NA
}
})
merge(df, to.merge,
by.x=c('x', 'y', 'time'),
by.y=c('x', 'y', 'closest_time_in_df'),
all.x=T)
But this doesn't merge the (2, 1, 11)
row because to.merge$closest_time_in_df
for (2, 1, 11.5, c)
is 12, but a time of 12 in df
corresponds to (x, y) = (2, 5) not (2, 1) hence the merge fails. 但是这并没有合并
(2, 1, 11)
to.merge$closest_time_in_df
(2, 1, 11)
行,因为( to.merge$closest_time_in_df
(2, 1, 11.5, c)
to.merge$closest_time_in_df
是12,但df
中12的时间对应于(x,y)=( 2,5)不是(2,1)因此合并失败。
Use data.table
and roll='nearest'
or to limit to 1, roll = 1, rollends = c(TRUE,TRUE)
使用
data.table
和roll='nearest'
或限制为1, roll = 1, rollends = c(TRUE,TRUE)
eg 例如
library(data.table)
# create data.tables with the same key columns (x, y, time)
DT <- data.table(df, key = names(df))
tm <- data.table(to.merge, key = key(DT))
# use join syntax with roll = 'nearest'
tm[DT, roll='nearest']
# x y time val
# 1: 1 1 8 NA
# 2: 1 2 27 NA
# 3: 1 3 28 NA
# 4: 1 4 2 NA
# 5: 1 5 21 NA
# 6: 2 1 11 c
# 7: 2 2 6 NA
# 8: 2 3 20 NA
# 9: 2 4 6 e
# 10: 2 5 12 NA
# 11: 3 1 17 NA
# 12: 3 2 27 NA
# 13: 3 3 19 NA
# 14: 3 4 5 NA
# 15: 3 5 23 d
You can limit your self to looking forward and back (1) by setting roll=-1
and rollends = c(TRUE,TRUE)
你可以通过设置
roll=-1
和rollends = c(TRUE,TRUE)
来限制你的自我前瞻和后退(1 rollends = c(TRUE,TRUE)
new <- tm[DT, roll=-1, rollends =c(TRUE,TRUE)]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Or you can roll=1 first, then roll=-1, then combine the results (tidying up the val.1 column from the second rolling join) 或者你可以先滚动= 1,然后滚动= -1,然后合并结果(整理第二个滚动连接的val.1列)
new <- tm[DT, roll = 1][tm[DT,roll=-1]][is.na(val), val := ifelse(is.na(val.1),val,val.1)][,val.1 := NULL]
new
x y time val
1: 1 1 8 NA
2: 1 2 27 NA
3: 1 3 28 NA
4: 1 4 2 NA
5: 1 5 21 NA
6: 2 1 11 c
7: 2 2 6 NA
8: 2 3 20 NA
9: 2 4 6 NA
10: 2 5 12 NA
11: 3 1 17 NA
12: 3 2 27 NA
13: 3 3 19 NA
14: 3 4 5 NA
15: 3 5 23 d
Using merge
couple of times and aggregate
once, here is how to do it. 使用
merge
几次并aggregate
一次,这是如何做到的。
set.seed(1)
df <- cbind(expand.grid(x = 1:3, y = 1:5), time = round(runif(15) * 30))
to.merge <- data.frame(x = c(2, 2, 2, 3, 2), y = c(1, 1, 1, 5, 4), time = c(17, 12, 11.6, 22.5, 2), val = letters[1:5], stringsAsFactors = F)
#Find rows that match by x and y
res <- merge(to.merge, df, by = c("x", "y"), all.x = TRUE)
res$dif <- abs(res$time.x - res$time.y)
res
## x y time.x val time.y dif
## 1 2 1 17.0 a 11 6.0
## 2 2 1 12.0 b 11 1.0
## 3 2 1 11.6 c 11 0.6
## 4 2 4 2.0 e 6 4.0
## 5 3 5 22.5 d 23 0.5
#Find rows that need to be merged
res1 <- merge(aggregate(dif ~ x + y, data = res, FUN = min), res)
res1
## x y dif time.x val time.y
## 1 2 1 0.6 11.6 c 11
## 2 2 4 4.0 2.0 e 6
## 3 3 5 0.5 22.5 d 23
#Finally merge the result back into df
final <- merge(df, res1[res1$dif <= 1, c("x", "y", "val")], all.x = TRUE)
final
## x y time val
## 1 1 1 8 <NA>
## 2 1 2 27 <NA>
## 3 1 3 28 <NA>
## 4 1 4 2 <NA>
## 5 1 5 21 <NA>
## 6 2 1 11 c
## 7 2 2 6 <NA>
## 8 2 3 20 <NA>
## 9 2 4 6 <NA>
## 10 2 5 12 <NA>
## 11 3 1 17 <NA>
## 12 3 2 27 <NA>
## 13 3 3 19 <NA>
## 14 3 4 5 <NA>
## 15 3 5 23 d
mnel's answer uses roll = "nearest"
in a data.table
join but does not limit to +/- 1 as requested by the OP. mnel的答案在
data.table
连接中使用roll = "nearest"
,但不限制为OP请求的+/- 1。 In addition, MichaelChirico has suggested to use the on
parameter. 此外, MichaelChirico建议使用
on
参数。
This approach uses 这种方法使用
roll = "nearest"
, roll = "nearest"
, setDT()
to coerce a data.frame to data.table
without copying (introduced 2014-02-27 with v.1.9.2 of data.table
), setDT()
将data.frame强制转换为data.table
而不进行复制(引入2014-02-27与v.1.9.2 of data.table
), on
parameter which spares to set a key explicitely (introduced 2015-09-19 with v.1.9.6). on
参数(引入2015-09-19 with v.1.9.6)。 So, the code below 那么,下面的代码
library(data.table) # version 1.11.4 used
setDT(df)[setDT(to.merge), on = .(x, y, time), roll = "nearest",
val := replace(val, abs(x.time - i.time) > 1, NA)]
df
has updated df
: 已更新
df
:
xy time val 1: 1 1 8 <NA> 2: 2 1 11 c 3: 3 1 17 <NA> 4: 1 2 27 <NA> 5: 2 2 6 <NA> 6: 3 2 27 <NA> 7: 1 3 28 <NA> 8: 2 3 20 <NA> 9: 3 3 19 <NA> 10: 1 4 2 <NA> 11: 2 4 6 <NA> 12: 3 4 5 <NA> 13: 1 5 21 <NA> 14: 2 5 12 <NA> 15: 3 5 23 d
Note that the order of rows has not been changed (in contrast to Chinmay Patil's answer ) 请注意,行的顺序没有改变(与Chinmay Patil的答案形成对比)
In case df
must not be changed, a new data.table can be created by 如果不能更改
df
,可以创建新的data.table
result <- setDT(to.merge)[setDT(df), on = .(x, y, time), roll = "nearest",
.(x, y, time, val = replace(val, abs(x.time - i.time) > 1, NA))]
result
which returns the same result as above. 返回与上面相同的结果。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.