简体   繁体   English

通过两个最近的变量合并data.table

[英]Merge data.table by two nearest variables

I have two data tables with x,y coordinates and some other info which I would like to merge based on nearest neighbour distance, ie on the minimum in squared difference of both x and y (dx_i =min ([(x_i-x_j)^2+(y_i-y_j)^2]^0.5). Say I have the following two sets: 我有两个数据表,其中包含x,y坐标和其他一些我希望根据最近邻距离合并的信息,即x和y的平方差最小值(dx_i = min([(x_i-x_j)^ 2+(y_i-y_j)^ 2] ^ 0.5)。说我有以下两组:

DT1=data.table(x=1:5,y=3:7)    
DT2=data.table(x=c(2,4,2,3,6),y=c(2.5,3.1,2,3,5),Q=c('a','b','c','d','e'))

Then the desired result of the merge would be: 那么合并的期望结果将是:

   x y Q
1: 1 3 a
2: 2 4 d
3: 3 5 d
4: 4 6 e
5: 5 7 e

I could of course write a loop over DT1 to calculate the nearest neighbour for each row in DT1 and then merge based on this calculation, but that seems to defeat the purpose of data tables. 我当然可以在DT1上编写一个循环来计算DT1中每行的最近邻居,然后根据这个计算进行合并,但这似乎打败了数据表的目的。 Moreover, that will be very slow for data tables of several million rows. 而且,对于数百万行的数据表来说,这将是非常慢的。

I know that for a single column I could do a nearest neighbour merge like this 我知道对于单个列我可以像这样做最近邻居合并

DT2[DT1,roll="nearest"]

But that (logically) doesn't work when I define 2 keys (x and y) for the tables to be merged. 但是,当我为要合并的表定义2个键(x和y)时,(逻辑上)不起作用。 Does a similar syntax for a 2-parameter nearest neighbour merge exist? 是否存在2参数最近邻居合并的类似语法? If not, is there a smarter way to do this then just looping, like I mentioned? 如果没有,是否有更聪明的方法来做这个然后循环,就像我提到的那样?

One possible solution: 一种可能的方案:

func = function(u,v)
{
    vec = with(DT2, (u-x)^2 + (v-y)^2)
    DT2[which.min(vec),]$Q
}

transform(DT1, Q=apply(DT1, 1, function(u) func(u[1], u[2])))

#   x y Q
#1: 1 3 a
#2: 2 4 d
#3: 3 5 d
#4: 4 6 e
#5: 5 7 e

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM