有效地更新点之间的距离

Question

I have a data set that has n rows (observations) and p columns (features): 我有一个具有n行（观察）和p列（特征）的数据集：

import numpy as np
from scipy.spatial.distance import pdist, squareform
p = 3
n = 5
xOld = np.random.rand(n * p).reshape([n, p])

I am interested to get the distance between these points in a nxn matrix that really has nx (n-1)/2 unique values 我有兴趣获得实际上具有nx (n-1)/2唯一值的nxn矩阵中这些点之间的距离

sq_dists = pdist(xOld, 'sqeuclidean')
D_n = squareform(sq_dists)

Now imagine I get N additional observations and would like to update D_n . 现在想象一下，我得到了N其他观察结果，并且想更新D_n 。 One very inefficient way is: 一种非常低效的方法是：

N = 3
xNew = np.random.rand(N * p).reshape([N, p])
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)

However, considering that n ~ 10000 and N ~ 100, this will be redundant. 但是，考虑到n〜10000和N〜100，这将是多余的。 My goal is to get D_n_N more efficiently using D_n . 我的目标是让D_n_N更有效地利用D_n 。 In order to do that, I am dividing D_n_N as follows. 为了做到这一点，我将D_n_N划分如下。 I already have D_n and can calculate B [N x N] . 我已经有了D_n并且可以计算B [N x N] 。 However, I am wondering if there is a good way to calculate A (or A transpose) without bunch of for loops and finally construct D_n_N 但是，我想知道是否有一个很好的方法来计算A（或A转置）而没有一堆for循环并最终构造D_n_N

D_n (n x n)    A [n x N]
A.T [N x n]    B [N x N]

Thanks in advance. 提前致谢。

Answer 1

Pretty interesting problem! 相当有趣的问题！ Well I got to learn few new things here on the way to getting a solution on this. 好吧，在这里找到解决方案的途中，我需要学习一些新知识。

Steps involved : 涉及的步骤：

First off, we are introducing new pts here. 首先，我们在这里介绍新的积分。 So, we need to use cdist to get squared euclidean distances between the old and new pts. 因此，我们需要使用cdist来获得新旧点之间的平方欧几里得距离。 These would be accommodated in two blocks in the new output, one right below the old distances and one to the right of those old ones. 这些将被容纳在新输出中的两个块中，一个位于旧距离的正下方，另一个位于那些旧距离的正下方。
We also need to compute the pdist among the new pts and put its square-formed block along the trailing part of the new diagonal region. 我们还需要计算新pdist之间的pdist并将其square-formed块放在新对角线区域的尾部。

Schematically put new D_n_N would look like this : 示意性地将新的D_n_N看起来像这样：

[   D_n      cdist.T
  cdist      New pdist squarefomed]

Summing up, the implementation would look something along these lines - 总结起来，该实现将遵循以下思路：

cdists = cdist( xNew, xOld, 'sqeuclidean')

n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))

Runtime test 运行时测试

Approaches - 方法-

# Original approach
def org_app(D_n, xNew):
    sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
    D_n_N = squareform(sq_dists)
    return D_n_N    

# Proposed approach
def proposed_app(D_n, xNew, N):
    cdists = cdist( xNew, xOld, 'sqeuclidean')    
    n1 = D_n.shape[0]
    out = np.empty((n1+N,n1+N))
    out[:n1,:n1] = D_n
    out[n1:,:n1] = cdists
    out[:n1,n1:] = cdists.T
    out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
    return out

Timings - 时间-

In [102]: # Setup inputs
     ...: p = 3
     ...: n = 5000
     ...: xOld = np.random.rand(n * p).reshape([n, p])
     ...: 
     ...: sq_dists = pdist(xOld, 'sqeuclidean')
     ...: D_n = squareform(sq_dists)
     ...: 
     ...: N = 3000
     ...: xNew = np.random.rand(N * p).reshape([N, p])
     ...: 

In [103]: np.allclose( proposed_app(D_n, xNew, N), org_app(D_n, xNew))
Out[103]: True

In [104]: %timeit org_app(D_n, xNew)
1 loops, best of 3: 541 ms per loop

In [105]: %timeit proposed_app(D_n, xNew, N)
1 loops, best of 3: 201 ms per loop

Answer 2

Just use cdist : 只需使用cdist：

D_OO=cdist(xOld,xOld)

D_NN=cdist(xNew,xNew)
D_NO=cdist(xNew,xOld)
D_ON=cdist(xOld,xNew) # or D_NO.T

And finally : 最后：

D_=vstack((hstack((D_OO,D_ON)),(hstack((D_NO,D_NN)))))

有效地更新点之间的距离

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-03-14 18:54:04

解决方案2
1 2017-03-14 19:05:17

有效地更新点之间的距离

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-03-14 18:54:04

解决方案2 1 2017-03-14 19:05:17

解决方案1
2 已采纳 2017-03-14 18:54:04

解决方案2
1 2017-03-14 19:05:17