[英]Efficiently updating distance between points
I have a data set that has n rows (observations) and p columns (features): 我有一个具有n行(观察)和p列(特征)的数据集:
import numpy as np
from scipy.spatial.distance import pdist, squareform
p = 3
n = 5
xOld = np.random.rand(n * p).reshape([n, p])
I am interested to get the distance between these points in a nxn
matrix that really has nx (n-1)/2
unique values 我有兴趣获得实际上具有
nx (n-1)/2
唯一值的nxn
矩阵中这些点之间的距离
sq_dists = pdist(xOld, 'sqeuclidean')
D_n = squareform(sq_dists)
Now imagine I get N
additional observations and would like to update D_n
. 现在想象一下,我得到了
N
其他观察结果,并且想更新D_n
。 One very inefficient way is: 一种非常低效的方法是:
N = 3
xNew = np.random.rand(N * p).reshape([N, p])
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
However, considering that n ~ 10000 and N ~ 100, this will be redundant. 但是,考虑到n〜10000和N〜100,这将是多余的。 My goal is to get
D_n_N
more efficiently using D_n
. 我的目标是让
D_n_N
更有效地利用D_n
。 In order to do that, I am dividing D_n_N as follows. 为了做到这一点,我将D_n_N划分如下。 I already have
D_n
and can calculate B [N x N]
. 我已经有了
D_n
并且可以计算B [N x N]
。 However, I am wondering if there is a good way to calculate A (or A transpose) without bunch of for loops and finally construct D_n_N
但是,我想知道是否有一个很好的方法来计算A(或A转置)而没有一堆for循环并最终构造
D_n_N
D_n (n x n) A [n x N]
A.T [N x n] B [N x N]
Thanks in advance. 提前致谢。
Pretty interesting problem! 相当有趣的问题! Well I got to learn few new things here on the way to getting a solution on this.
好吧,在这里找到解决方案的途中,我需要学习一些新知识。
Steps involved : 涉及的步骤:
First off, we are introducing new pts here. 首先,我们在这里介绍新的积分。 So, we need to use
cdist
to get squared euclidean distances between the old and new pts. 因此,我们需要使用
cdist
来获得新旧点之间的平方欧几里得距离。 These would be accommodated in two blocks in the new output, one right below the old distances and one to the right of those old ones. 这些将被容纳在新输出中的两个块中,一个位于旧距离的正下方,另一个位于那些旧距离的正下方。
We also need to compute the pdist
among the new pts and put its square-formed
block along the trailing part of the new diagonal region. 我们还需要计算新
pdist
之间的pdist
并将其square-formed
块放在新对角线区域的尾部。
Schematically put new D_n_N
would look like this : 示意性地将新的
D_n_N
看起来像这样:
[ D_n cdist.T
cdist New pdist squarefomed]
Summing up, the implementation would look something along these lines - 总结起来,该实现将遵循以下思路:
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
Runtime test 运行时测试
Approaches - 方法-
# Original approach
def org_app(D_n, xNew):
sq_dists = pdist(np.row_stack([xOld, xNew]), 'sqeuclidean')
D_n_N = squareform(sq_dists)
return D_n_N
# Proposed approach
def proposed_app(D_n, xNew, N):
cdists = cdist( xNew, xOld, 'sqeuclidean')
n1 = D_n.shape[0]
out = np.empty((n1+N,n1+N))
out[:n1,:n1] = D_n
out[n1:,:n1] = cdists
out[:n1,n1:] = cdists.T
out[n1:,n1:] = squareform(pdist(xNew, 'sqeuclidean'))
return out
Timings - 时间-
In [102]: # Setup inputs
...: p = 3
...: n = 5000
...: xOld = np.random.rand(n * p).reshape([n, p])
...:
...: sq_dists = pdist(xOld, 'sqeuclidean')
...: D_n = squareform(sq_dists)
...:
...: N = 3000
...: xNew = np.random.rand(N * p).reshape([N, p])
...:
In [103]: np.allclose( proposed_app(D_n, xNew, N), org_app(D_n, xNew))
Out[103]: True
In [104]: %timeit org_app(D_n, xNew)
1 loops, best of 3: 541 ms per loop
In [105]: %timeit proposed_app(D_n, xNew, N)
1 loops, best of 3: 201 ms per loop
Just use cdist : 只需使用cdist:
D_OO=cdist(xOld,xOld)
D_NN=cdist(xNew,xNew)
D_NO=cdist(xNew,xOld)
D_ON=cdist(xOld,xNew) # or D_NO.T
And finally : 最后:
D_=vstack((hstack((D_OO,D_ON)),(hstack((D_NO,D_NN)))))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.