简体   繁体   English

找到第二个数据帧中每个元素的两个数据帧之间的最小距离

[英]Find the minimum distance between two data frames, for each element in the second data frame

I have two data frames ev1 and ev2, describing timestamps of two types of events collected over many tests. 我有两个数据框ev1和ev2,描述了在许多测试中收集的两种类型事件的时间戳。 So, each data frame has columns "test_id", and "timestamp". 因此,每个数据帧都有“test_id”和“timestamp”列。 What I need to find is the minimum distance of ev1 for each ev2, in the same test. 我需要找到的是在同一测试中每个ev2的最小距离ev1。

I have a working code that merges the two datasets, calculates the distances, and then uses dplyr to filter for the minimum distance: 我有一个工作代码合并两个数据集,计算距离,然后使用dplyr过滤最小距离:

ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(6, 1, 8, 4, 5, 11))

data <- merge(ev2, ev1, by=c("test_id"), suffixes=c(".ev2", ".ev1"))

data$distance <- data$time.ev2 - data$time.ev1

min_data <- data %>%
  group_by(test_id, time.ev2) %>%
  filter(abs(distance) == min(abs(distance)))

While this works, the merge part is very slow and feels inefficient -- I'm generating a huge table with all combinations of ev2->ev1 for the same test_id, only to filter it down to one. 虽然这有效,但合并部分非常慢并且感觉效率低下 - 我正在生成一个包含ev2-> ev1的所有组合的巨大表格,用于相同的test_id,仅将其过滤为1。 It seems like there should be a way to "filter on the fly", during the merge. 在合并期间,似乎应该有一种“即时过滤”的方法。 Is there? 在那儿?

Update : The following case with two "group by" columns fails when data.table approach outlined by akrun is used: 更新 :当使用akrun概述的data.table方法时,以下两个“group by”列的情况会失败:

ev1 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(1, 2, 3, 2, 3, 4), group_id=c(0, 0, 0, 1, 1, 1))
ev2 = data.frame(test_id = c(0, 0, 0, 1, 1, 1), time=c(5, 6, 7, 1, 2, 8), group_id=c(0, 0, 0, 1, 1, 1))
setkey(setDT(ev1), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=abs(time-i.time)]

Error in eval(expr, envir, enclos) : object 'i.time' not found eval(expr,envir,enclos)出错:找不到对象'i.time'

Here's how I'd do it using data.table : 这是我如何使用data.table做到这data.table

require(data.table)
setkey(setDT(ev1), test_id)
ev1[ev2, .(ev2.time = i.time, ev1.time = time[which.min(abs(i.time - time))]), by = .EACHI]
#    test_id ev2.time ev1.time
# 1:       0        6        3
# 2:       0        1        1
# 3:       0        8        3
# 4:       1        4        4
# 5:       1        5        4
# 6:       1       11        4

In joins of the form x[i] in data.table , the prefix i. data.table中的形式x[i]data.table ,前缀为i. is used to refer the columns in i , when both x and i share the same name for a particular column. xi共享特定列的相同名称时,用于引用i列。

Please see this SO post for an explanation on how this works. 请参阅此SO帖子 ,了解其工作原理。

This is syntactically more straightforward to understand what's going on, and is memory efficient (at the expense of little speed 1 ) as it doesn't materialise the entire join result at all. 这在语法上更容易理解正在发生的事情,并且内存有效(以低速1为代价),因为它根本没有实现整个连接结果。 In fact, this does exactly what you say in your post - filter on the fly, while merging . 事实上,这正是你在帖子中所说的 - 在合并时动态过滤

  1. On speed, it doesn't matter in most of the cases really. 速度上,在大多数情况下确实无关紧要。 If there are a lot of rows in i , it might be a tad slower as the j -expression will have to be evaluated for each row in i . 如果i有很多行,则可能会慢一点,因为必须为i每一行计算j表达式。 In contrast, @akrun's answer does a cartesian join followed by one filtering. 相比之下,@ akrun的答案是笛卡尔连接,然后进行一次过滤。 So while it's high on memory, it doesn't evaluate j for each row in i . 因此,虽然它的内存很高,但它不会为i每一行计算j But again, this shouldn't even matter unless you work with really large i which is not often the case. 但同样,这甚至不应该重要,除非你使用非常大的 i ,而事实并非如此。

HTH HTH

May be this helps: 可能有帮助:

library(data.table)
setkey(setDT(ev1), test_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=time-i.time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, i.time)]$V1]
#    test_id time i.time distance
#1:       0    3      6        3
#2:       0    1      1        0
#3:       0    3      8        5
#4:       1    4      4        0
#5:       1    4      5        1
#6:       1    4     11        7

Or 要么

 ev1[ev2, allow.cartesian=TRUE][,distance:= time-i.time][,
      .SD[abs(distance)==min(abs(distance))], by=list(test_id, i.time)]

Update 更新

Using the new grouping 使用新的分组

setkey(setDT(ev1), test_id, group_id)
setkey(setDT(ev2), test_id, group_id)
DT <- ev1[ev2, allow.cartesian=TRUE][,distance:=i.time-time]
DT[DT[,abs(distance)==min(abs(distance)), by=list(test_id, 
                                group_id,i.time)]$V1]$distance
#[1]  2  3  4 -1  0  4

Based on the code you provided 根据您提供的代码

min_data$distance
#[1]  2  3  4 -1  0  4

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM