简体繁体 English

高效计算欧氏距离

[英]Efficient calculation of euclidean distance

原文 2017-03-19 03:29:42 7 3 python/ algorithm/ python-3.x/ euclidean-distance

I have a MxN array, where M is the number of observations and N is the dimensionality of each vector. 我有一个MxN数组，其中M是观测数， N是每个矢量的维数。 From this array of vectors, I need to calculate the mean and minimum euclidean distance between the vectors. 从这个向量数组中，我需要计算向量之间的mean和minimum欧氏距离。

In my mind, this requires me to calculate _M C ₂ distances, which is an O(n ^{min(k, nk)} ) algorithm. 在我看来，这需要我计算_M C ₂距离，这是一个O（n ^{min（k，nk）} ）算法。 My M is ~10,000 and my N is ~1,000, and this computation takes ~45 seconds. 我的M是~10,000，我的N是~1,000，这个计算需要大约45秒。

Is there a more efficient way to compute the mean and min distances? 有没有更有效的方法来计算mean距离和min距离？ Perhaps a probabilistic method? 也许是一种概率方法？ I don't need it to be exact, just close. 我不需要它准确，只需要关闭。

3 个解决方案

You didn't describe where your vectors come from, nor what use you will put mean and median to. 你没有描述你的载体来自哪里，也没有用你将mean和median用于什么。 Here are some observations about the general case. 以下是关于一般情况的一些观察。 Limited ranges, error tolerance, and discrete values may admit of a more efficient approach. 有限范围，误差容限和离散值可能允许更有效的方法。

The mean distance between M points sounds quadratic, O(M^2). M点之间的mean距离听起来是二次的，O（M ^ 2）。 But M / N is 10, fairly small, and N is huge, so the data probably resembles a hairy sphere in 1e3-space. 但是M / N是10，相当小，N很大，所以数据可能类似于1e3空间中的毛球。 Computing centroid of M points, and then computing M distances to centroid, might turn out to be useful in your problem domain, hard to tell. 计算M点的质心，然后计算M到质心的距离，可能会在您的问题域中变得有用，很难说。

The minimum distance among M points is more interesting. M点之间的minimum距离更有趣。 Choose a small number of pairs at random, say 100, compute their distance, and take half the minimum as an estimate of the global minimum distance. 随机选择少量对，比如100，计算它们的距离，并将最小值的一半作为全局最小距离的估计值。 (Validate by comparing to the next few smallest distances, if desired.) Now use spatial UB-tree to model each point as a positive integer. （如果需要，通过与下几个最小距离进行比较来验证。）现在使用空间UB树将每个点建模为正整数。 This involves finding N minima for M x N values, adding constants so min becomes zero, scaling so estimated global min distance corresponds to at least 1.0, and then truncating to integer. 这涉及为M×N值找到N minima，添加常数使得min变为零，缩放所以估计的全局最小距离对应于至少1.0，然后截断为整数。

With these transformed vectors in hand, we're ready to turn them into a UB-tree representation that we can sort, and then do nearest neighbor spatial queries on the sorted values. 有了这些变换后的矢量，我们就可以将它们变成我们可以排序的UB树表示，然后对排序后的值进行最近邻空间查询。 For each point compute an integer. 为每个点计算一个整数。 Shift the low-order bit of each dimension's value into the result, then iterate. 将每个维度的值的低位移到结果中，然后迭代。 Continue iterating over all dimensions until non-zero bits have all been consumed and appear in the result, and proceed to the next point. 继续迭代所有维度，直到非零位全部消耗并出现在结果中，然后继续下一个点。 Numerically sort the integer result values, yielding a data structure similar to a PostGIS index. 对整数结果值进行数字排序，得到类似于PostGIS索引的数据结构。

Now you have a discretized representation that supports reasonably efficient queries for nearest neighbors (though admittedly N=1e3 is inconveniently large). 现在你有一个离散的表示，支持对最近邻居的合理有效的查询（虽然不可否认N = 1e3很不方便）。 After finding two or more coarse-grained nearby neighbors, you can query the original vector representation to obtain high-resolution distances between them, for finer discrimination. 在找到两个或多个粗粒度的邻近邻居之后，您可以查询原始矢量表示以获得它们之间的高分辨率距离，以获得更好的区分。 If your data distribution turns out to have a large fraction of points that discretize to being off by single bit from nearest neighbor, eg location of oxygen atoms where each has a buddy, then increase the global min distance estimate so the low order bits offer adequate discrimination. 如果你的数据分布变得有很大一部分点离开最近邻的单个位，例如氧原子的位置，每个都有一个伙伴，那么增加全局最小距离估计，这样低阶位提供足够的歧视。

A similar discretization approach would be appropriately scaling eg 2-dimensional inputs and marking an initially empty grid, then scanning immediate neighborhoods. 类似的离散化方法将适当地缩放例如2维输入并标记最初空的网格，然后扫描直接邻域。 This relies on global min being within a "small" neighborhood, due to appropriate scaling. 由于适当的缩放，这依赖于全球min在“小”邻域内。 In your case you would be marking an N-dimensional grid. 在您的情况下，您将标记一个N维网格。

You may be able to speed things up with some sort of Space Partitioning . 您可以通过某种空间分区加快速度。

For the minimum distance calculation, you would only need to consider pairs of points in the same or neigbouring partitions. 对于最小距离计算，您只需要考虑相同或相邻分区中的点对。 For an approximate mean, you might be able to come up with some sort of weighted average based on the distances between partitions and the number of points within them. 对于近似均值，您可以根据分区之间的距离和它们内的点数得出某种加权平均值。

I had the same issue before, and it worked for me once I normalized the values. 之前我遇到过同样的问题，一旦我对值进行了规范化，它对我有用。 So try to normalize the data before calculating the distance. 因此，在计算距离之前，请尝试对数据进行标准化。