来自两个独立数据帧的距离矩阵

Question

I'd like to create a matrix which contains the euclidean distances of the rows from one data frame versus the rows from another. 我想创建一个矩阵，其中包含从一个数据帧到另一个数据帧的行的欧几里德距离。 For example, say I have the following data frames: 例如，假设我有以下数据框：

a <- c(1,2,3,4,5)
b <- c(5,4,3,2,1)
c <- c(5,4,1,2,3)
df1 <- data.frame(a,b,c)

a2 <- c(2,7,1,2,3)
b2 <- c(7,6,5,4,3)
c2 <- c(1,2,3,4,5)
df2 <- data.frame(a2,b2,c2)

I would like to create a matrix with the distances of each row in df1 versus the rows of df2. 我想创建一个矩阵，其中df1中每行的距离与df2的行相距。

So matrix[2,1] should be the euclidean distance between df1[2,] and df2[1,]. 因此矩阵[2,1]应该是df1 [2，]和df2 [1，]之间的欧氏距离。 matrix[3,2] the distance between df[3,] and df2[2,], etc. 矩阵[3,2] df [3，]和df2 [2，]等之间的距离。

Does anyone know how this could be achieved? 有谁知道如何实现这一目标？

Answer 1

Perhaps you could use the fields package: the function rdist might do what you want: 也许您可以使用fields包：函数rdist可能会执行您想要的操作：

rdist : Euclidean distance matrix rdist：欧氏距离矩阵
Description: Given two sets of locations computes the Euclidean distance matrix among all pairings. 描述：给定两组位置计算所有配对中的欧几里德距离矩阵。

> rdist(df1, df2)
     [,1]     [,2]     [,3]     [,4]     [,5]
[1,] 4.582576 6.782330 2.000000 1.732051 2.828427
[2,] 4.242641 5.744563 1.732051 0.000000 1.732051
[3,] 4.123106 5.099020 3.464102 3.316625 4.000000
[4,] 5.477226 5.000000 4.358899 3.464102 3.316625
[5,] 7.000000 5.477226 5.656854 4.358899 3.464102

Similar is the case with the pdist package 与pdist包类似

pdist : Distances between Observations for a Partitioned Matrix pdist：分区矩阵的观察值之间的距离
Description: Computes the euclidean distance between rows of a matrix X and rows of another matrix Y. 描述：计算矩阵X的行与另一个矩阵Y的行之间的欧氏距离。

> pdist(df1, df2)
An object of class "pdist"
Slot "dist":
[1] 4.582576 6.782330 2.000000 1.732051 2.828427 4.242640 5.744563 1.732051
[9] 0.000000 1.732051 4.123106 5.099020 3.464102 3.316625 4.000000 5.477226
[17] 5.000000 4.358899 3.464102 3.316625 7.000000 5.477226 5.656854 4.358899
[25] 3.464102
attr(,"Csingle")
[1] TRUE

Slot "n":
[1] 5

Slot "p":
[1] 5

Slot ".S3Class":
[1] "pdist"

# ＃

NOTE: If you're looking for the Euclidean norm between rows, you might want to try: 注意：如果您正在寻找行之间的欧几里德规范，您可能需要尝试：

> rdist(df1, df2)
         [,1]     [,2]     [,3]
[1,] 6.164414 7.745967 0.000000
[2,] 5.099020 4.472136 6.324555
[3,] 4.242641 5.291503 5.656854

This gives: 这给出了：

 > rdist(df1, df2) [,1] [,2] [,3] [1,] 6.164414 7.745967 0.000000 [2,] 5.099020 4.472136 6.324555 [3,] 4.242641 5.291503 5.656854

Answer 2

This is adapted from my previous answer here . 这是根据我之前的答案改编的。

For general n -dimensional Euclidean distance, we can exploit the equation (not R, but algebra): 对于一般的n维欧氏距离，我们可以利用方程（不是R，而是代数）：

square_dist(b,a) = sum_i(b[i]*b[i]) + sum_i(a[i]*a[i]) - 2*inner_prod(b,a)

where the sums are over the dimensions of vectors a and b for i=[1,n] . 其中总和超过向量a和b的维数，对于i=[1,n] 。 Here, a and b are one pair of columns from df1 and df2 , respectively. 这里， a和b分别是来自df1和df2一对列。 The key here is that this equation can be written as a matrix equation for all pairs in df1 and df2 . 这里的关键是这个方程可以写成df1和df2所有对的矩阵方程。

In code: 在代码中：

d <- sqrt(matrix(rowSums(expand.grid(rowSums(df1*df1),rowSums(df2*df2))),
                 nrow=nrow(df1)) - 
          2. * as.matrix(df1) %*% t(as.matrix(df2)))

Notes: 笔记：

The inner rowSums compute sum_i(a[i]*a[i]) and sum_i(b[i]*b[i]) for each a in df1 and b in df2 , respectively. 内rowSums计算sum_i(a[i]*a[i])和sum_i(b[i]*b[i])为每个a在df1和b在df2分别。
expand.grid then generates all pairs between df1 and df2 . 然后expand.grid生成df1和df2之间的所有对。
The outer rowSums computes the sum_i(a[i]*a[i]) + sum_i(b[i]*b[i]) for all these pairs. 外rowSums计算所有这些对的sum_i(a[i]*a[i]) + sum_i(b[i]*b[i]) 。
This result is then reshaped into a matrix . 然后将该结果重新整形为matrix 。 Note that the number of rows of this matrix is the number of rows of df1 . 请注意，此矩阵的行数是df1的行数。
Then subtract two times the inner product of all pairs. 然后减去所有对的内积的两倍。 This inner product can be written as a matrix multiply df1 %*% t(df2) where I left out the coercion to matrix for clarity. 这个内积可以写成矩阵乘以df1 %*% t(df2) ，其中为了清楚起见我将强制省略到矩阵。
Finally, take the square root. 最后，取平方根。

Using this code with your data: 将此代码与您的数据一起使用：

print(d)
##         [,1]     [,2]     [,3]     [,4]     [,5]
##[1,] 4.582576 6.782330 2.000000 1.732051 2.828427
##[2,] 4.242641 5.744563 1.732051 0.000000 1.732051
##[3,] 4.123106 5.099020 3.464102 3.316625 4.000000
##[4,] 5.477226 5.000000 4.358899 3.464102 3.316625
##[5,] 7.000000 5.477226 5.656854 4.358899 3.464102

Note that this code will work for any n > 1 . 请注意，此代码适用于任何n > 1 。 In your case, n=3 . 在你的情况下， n=3 。

来自两个独立数据帧的距离矩阵

问题描述

2 个解决方案

解决方案1
6 已采纳 2016-09-18 18:32:33

解决方案2
2 2016-09-18 18:35:24

来自两个独立数据帧的距离矩阵

问题描述

2 个解决方案

解决方案1 6 已采纳 2016-09-18 18:32:33

解决方案2 2 2016-09-18 18:35:24

解决方案1
6 已采纳 2016-09-18 18:32:33

解决方案2
2 2016-09-18 18:35:24