[英]Find the minimum distance between two data frames, for each element in the second data frame
I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. 我有两个数据框ev1和ev2,描述了在许多测试中收集的两种类型事件的时间戳。 So, each data frame has columns "test_id", and "timestamp".
因此,每个数据帧都有“test_id”和“timestamp”列。 What I need to find is the minimum distance of ev1 for each ev2, in the same test.
我需要找到的是在同一测试中每个ev2的最小距离ev1。
I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance: 我有一个工作代码合并两个数据集,计算距离,然后使用dplyr过滤最小距离:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))
data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))
data$distance <- data$time.ev2 - data$time.ev1
min_data <- data %>%
group_by(test_id, time.ev2) %>%
filter(abs(distance) == min(abs(distance)))
While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. 虽然这有效,但合并部分非常慢并且感觉效率低下 - 我正在生成一个包含ev2-> ev1的所有组合的巨大表格,用于相同的test_id,仅将其过滤为1。 It seems like there should be a way to "filter on the fly", during the merge.
在合并期间,似乎应该有一种“即时过滤”的方法。 Is there?
在那儿?
Update : The following case with two "group by" columns fails when data.table approach outlined by akrun is used: 更新 :当使用akrun概述的data.table方法时,以下两个“group by”列的情况会失败:
ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]
Error in eval(expr, envir, enclos) : object 'i.time' not found eval(expr,envir,enclos)出错:找不到对象'i.time'
Here's how I'd do it using data.table
: 这是我如何使用
data.table
做到这data.table
:
require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time = i.time, ev1.time = time[which.min(abs(i.time - time))]), by = .EACHI]
# test_id ev2.time ev1.time
# 1: 0 6 3
# 2: 0 1 1
# 3: 0 8 3
# 4: 1 4 4
# 5: 1 5 4
# 6: 1 11 4
In joins of the form x[i]
in data.table
, the prefix i.
在
data.table
中的形式x[i]
的data.table
,前缀为i.
is used to refer the columns in i
, when both x
and i
share the same name for a particular column. 当
x
和i
共享特定列的相同名称时,用于引用i
列。
Please see this SO post for an explanation on how this works. 请参阅此SO帖子 ,了解其工作原理。
This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed 1 ) as it doesn't materialise the entire join result at all. 这在语法上更容易理解正在发生的事情,并且内存有效(以低速1为代价),因为它根本没有实现整个连接结果。 In fact, this does exactly what you say in your post - filter on the fly, while merging .
事实上,这正是你在帖子中所说的 - 在合并时动态过滤 。
i
, it might be a tad slower as the j
-expression will have to be evaluated for each row in i
. i
中有很多行,则可能会慢一点,因为必须为i
每一行计算j
表达式。 In contrast, @akrun's answer does a cartesian join followed by one filtering. j
for each row in i
. i
每一行计算j
。 But again, this shouldn't even matter unless you work with really large i
which is not often the case. i
,而事实并非如此。 HTH HTH
May be this helps: 可能有帮助:
library(data.table)
setkey(setDT(ev1), test_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=time-i.time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, i.time)]$V1]
# test_id time i.time distance
#1: 0 3 6 3
#2: 0 1 1 0
#3: 0 3 8 5
#4: 1 4 4 0
#5: 1 4 5 1
#6: 1 4 11 7
Or 要么
ev1[ev2, allow.cartesian=TRUE][,distance:= time-i.time][,
.SD[abs(distance)==min(abs(distance))], by=list(test_id, i.time)]
Using the new grouping 使用新的分组
setkey(setDT(ev1), test_id, group_id)
setkey(setDT(ev2), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=i.time-time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id,
group_id,i.time)]$V1]$distance
#[1] 2 3 4 -1 0 4
Based on the code you provided 根据您提供的代码
min_data$distance
#[1] 2 3 4 -1 0 4
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.