简体   繁体   English

欧氏距离的高效精确计算

[英]Efficient and precise calculation of the euclidean distance

Following some online research ( 1 , 2 , numpy , scipy , scikit , math ), I have found several ways for calculating the Euclidean Distance in Python :经过一些在线研究( 12numpyscipyscikitmath ),我找到了几种在 Python 中计算欧几里德距离的方法

# 1
numpy.linalg.norm(a-b)

# 2
distance.euclidean(vector1, vector2)

# 3
sklearn.metrics.pairwise.euclidean_distances  

# 4
sqrt((xa-xb)^2 + (ya-yb)^2 + (za-zb)^2)

# 5
dist = [(a - b)**2 for a, b in zip(vector1, vector2)]
dist = math.sqrt(sum(dist))

# 6
math.hypot(x, y)

I was wondering if someone could provide an insight on which of the above ( or any other that I have not found ) is considered the best in terms of efficiency and precision .我想知道是否有人可以提供有关在效率精度方面被认为是最好的上述哪一个(或我没有找到的任何其他)的见解。 If someone is aware of any resource(s) which discusses the subject that would also be great.如果有人知道讨论该主题的任何资源,那也会很棒。

The context I am interesting in is in calculating the Euclidean Distance between pairs of number-tuples, eg the distance between (52, 106, 35, 12) and (33, 153, 75, 10) .我感兴趣的上下文是计算数元组对之间的欧几里得距离,例如(52, 106, 35, 12)(33, 153, 75, 10)之间的距离。

Conclusion first:先说结论:

From the test result by using timeit for efficiency test, we can conclude that regarding the efficiency :从使用timeit进行效率测试的测试结果,我们可以得出关于效率的结论:

Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances ) Method5 (zip, math.sqrt) > Method1 (numpy.linalg.norm) > Method2 (scipy.spatial.distance) > Method3 (sklearn.metrics.pairwise.euclidean_distances )

While I didn't really test your Method4 as it is not suitable for general cases and it is generally equivalent to Method5 .虽然我没有真正测试你的Method4因为它不适合一般情况,它通常等同于Method5

For the rest, quite surprisingly, Method5 is the fastest one.对于其余的,相当令人惊讶的是, Method5是最快的。 While for Method1 which uses numpy , as what we expected, which is heavily optimized in C, is the second fastest.而对于使用numpy Method1 ,正如我们预期的那样,在 C 中进行了大量优化,是第二快的。

For scipy.spatial.distance , if you go directly to the function definition, you will see that it is actually using numpy.linalg.norm , except it will perform the validation on the two input vectors before the actual numpy.linalg.norm .对于scipy.spatial.distance ,如果你直接进入函数定义,你会看到它实际上是在使用numpy.linalg.norm ,除了它会在实际的numpy.linalg.norm之前对两个输入向量执行验证。 That's why it is slightly slower thant numpy.linalg.norm .这就是为什么它比numpy.linalg.norm稍慢的numpy.linalg.norm

Finally for sklearn , according to the documentation:最后对于sklearn ,根据文档:

This formulation has two advantages over other ways of computing distances.与其他计算距离的方法相比,此公式有两个优点。 First, it is computationally efficient when dealing with sparse data.首先,它在处理稀疏数据时计算效率高。 Second, if one argument varies but the other remains unchanged, then dot(x, x) and/or dot(y, y) can be pre-computed.其次,如果一个参数发生变化而另一个参数保持不变,则可以预先计算 dot(x, x) 和/或 dot(y, y)。 However, this is not the most precise way of doing this computation, and the distance matrix returned by this function may not be exactly symmetric as required但是,这不是进行此计算的最精确方法,并且此函数返回的距离矩阵可能并非按要求完全对称

Since in your question you would like to use a fixed set of data, the advantage of this implementation is not reflected.由于在您的问题中您想使用一组固定的数据,因此没有体现这种实现的优势。 And due to the trade off between the performance and precision, it also gives the worst precision among all of the methods.并且由于性能和精度之间的权衡,它也给出了所有方法中最差的精度。

Regarding the precision , Method5 = Metho1 = Method2 > Method3关于精确度Method5 = Metho1 = Method2 > Method3

Efficiency Test Script:效率测试脚本:

import numpy as np
from scipy.spatial import distance
from sklearn.metrics.pairwise import euclidean_distances
import math

# 1
def eudis1(v1, v2):
    return np.linalg.norm(v1-v2)

# 2
def eudis2(v1, v2):
    return distance.euclidean(v1, v2)

# 3
def eudis3(v1, v2):
    return euclidean_distances(v1, v2)

# 5
def eudis5(v1, v2):
    dist = [(a - b)**2 for a, b in zip(v1, v2)]
    dist = math.sqrt(sum(dist))
    return dist

dis1 = (52, 106, 35, 12)
dis2 = (33, 153, 75, 10)
v1, v2 = np.array(dis1), np.array(dis2)

import timeit

def wrapper(func, *args, **kwargs):
    def wrapped():
        return func(*args, **kwargs)
    return wrapped

wrappered1 = wrapper(eudis1, v1, v2)
wrappered2 = wrapper(eudis2, v1, v2)
wrappered3 = wrapper(eudis3, v1, v2)
wrappered5 = wrapper(eudis5, v1, v2)
t1 = timeit.repeat(wrappered1, repeat=3, number=100000)
t2 = timeit.repeat(wrappered2, repeat=3, number=100000)
t3 = timeit.repeat(wrappered3, repeat=3, number=100000)
t5 = timeit.repeat(wrappered5, repeat=3, number=100000)

print('\n')
print('t1: ', sum(t1)/len(t1))
print('t2: ', sum(t2)/len(t2))
print('t3: ', sum(t3)/len(t3))
print('t5: ', sum(t5)/len(t5))

Efficiency Test Output:效率测试输出:

t1:  0.654838958307
t2:  1.53977598714
t3:  6.7898791732
t5:  0.422228400305

Precision Test Script & Result:精密测试脚本和结果:

In [8]: eudis1(v1,v2)
Out[8]: 64.60650122085238

In [9]: eudis2(v1,v2)
Out[9]: 64.60650122085238

In [10]: eudis3(v1,v2)
Out[10]: array([[ 64.60650122]])

In [11]: eudis5(v1,v2)
Out[11]: 64.60650122085238

This is not exactly answering the question, but it is probably worth mentioning that if you aren't interested in the actual euclidean distance, but just want to compare euclidean distances against each other, square roots are monotone functions, ie x**(1/2) < y**(1/2) if and only if x < y.这并不能完全回答问题,但可能值得一提的是,如果您对实际的欧氏距离不感兴趣,而只想比较欧氏距离,平方根是单调函数,即 x**(1 /2) < y**(1/2) 当且仅当 x < y。

So if you don't want the explicit distance, but for instance just want to know if the euclidean distance of vector1 is closer to a list of vectors, called vectorlist, you can avoid the expensive (in terms of both precision and time) square root, but can make do with something like因此,如果您不想要显式距离,但例如只想知道 vector1 的欧几里得距离是否更接近称为 vectorlist 的向量列表,则可以避免昂贵的(在精度和时间方面)平方根,但可以用类似的东西

min(vectorlist, key = lambda compare: sum([(a - b)**2 for a, b in zip(vector1, compare)])

As a general rule of thumb, stick to the scipy and numpy implementations where possible, as they're vectorized and much faster than native Python code.作为一般经验法则,尽可能坚持使用scipynumpy实现,因为它们是矢量化的,并且比原生 Python 代码快得多。 (Main reasons are: implementations in C, vectorization eliminates type checking overhead that looping does.) (主要原因是:在 C 中实现,向量化消除了循环所做的类型检查开销。)

(Aside: My answer doesn't cover precision here, but I think the same principle applies for precision as for efficiency.) (旁白:我的回答不包括精度,但我认为同样的原则适用于精度和效率。)

As a bit of a bonus, I'll chip in with a bit of information on how you can profile your code, to measure efficiency.作为一点奖励,我将提供一些有关如何分析代码以衡量效率的信息。 If you're using the IPython interpreter, the secret is to use the %prun line magic.如果您使用的是 IPython 解释器,秘诀就是使用%prun行魔法。

In [1]: import numpy

In [2]: from scipy.spatial import distance

In [3]: c1 = numpy.array((52, 106, 35, 12))

In [4]: c2 = numpy.array((33, 153, 75, 10))

In [5]: %prun distance.euclidean(c1, c2)
         35 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        6    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        4    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 distance.py:232(euclidean)
        2    0.000    0.000    0.000    0.000 distance.py:152(_validate_vector)
        2    0.000    0.000    0.000    0.000 shape_base.py:9(atleast_1d)
        1    0.000    0.000    0.000    0.000 misc.py:11(norm)
        1    0.000    0.000    0.000    0.000 function_base.py:605(asarray_chkfinite)
        2    0.000    0.000    0.000    0.000 numeric.py:476(asanyarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        2    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        4    0.000    0.000    0.000    0.000 {built-in method builtins.len}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 {method 'squeeze' of 'numpy.ndarray' objects}


In [6]: %prun numpy.linalg.norm(c1 - c2)
         10 function calls in 0.000 seconds

   Ordered by: internal time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.exec}
        1    0.000    0.000    0.000    0.000 linalg.py:1976(norm)
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.dot}
        1    0.000    0.000    0.000    0.000 <string>:1(<module>)
        1    0.000    0.000    0.000    0.000 numeric.py:406(asarray)
        1    0.000    0.000    0.000    0.000 {method 'ravel' of 'numpy.ndarray' objects}
        1    0.000    0.000    0.000    0.000 linalg.py:111(isComplexType)
        1    0.000    0.000    0.000    0.000 {built-in method builtins.issubclass}
        1    0.000    0.000    0.000    0.000 {built-in method numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}

What %prun does is tell you how long a function call takes to run, including a bit of trace to figure out where the bottleneck might be. %prun所做的是告诉你一个函数调用需要多长时间运行,包括一些跟踪来找出瓶颈可能在哪里。 In this case, both the scipy.spatial.distance.euclidean and numpy.linalg.norm implementations are pretty fast.在这种情况下, scipy.spatial.distance.euclideannumpy.linalg.norm实现都非常快。 Assuming you defined a function dist(vect1, vect2) , you can profile using the same IPython magic call.假设您定义了一个函数dist(vect1, vect2) ,您可以使用相同的 IPython 魔术调用进行分析。 As another added bonus, %prun also works inside the Jupyter notebook, and you can do %%prun to profile an entire cell of code, rather than just one function, simply by making %%prun the first line of that cell.作为另一个额外的好处, %prun也可以在 Jupyter notebook 中使用,您可以执行%%prun来分析整个代码单元格,而不仅仅是一个函数,只需将%%prun该单元格的第一行即可。

I don't know how the precision and speed compares to the other libraries you mentioned, but you can do it for 2D vectors using the built-in math.hypot() function:我不知道精度和速度与您提到的其他库相比如何,但是您可以使用内置的math.hypot()函数对 2D 向量执行此操作:

from math import hypot

def pairwise(iterable):
    "s -> (s0, s1), (s1, s2), (s2, s3), ..."
    a, b = iter(iterable), iter(iterable)
    next(b, None)
    return zip(a, b)

a = (52, 106, 35, 12)
b = (33, 153, 75, 10)

dist = [hypot(p2[0]-p1[0], p2[1]-p1[1]) for p1, p2 in pairwise(tuple(zip(a, b)))]
print(dist)  # -> [131.59027319676787, 105.47511554864494, 68.94925670375281]

Here is an example on how to use just numpy.这是一个关于如何仅使用 numpy 的示例。

import numpy as np

a = np.array([3, 0])
b = np.array([0, 4])

c = np.sqrt(np.sum(((a - b) ** 2)))
# c == 5.0

Improving benchmark on the accepted answer , I've found out that, assuming you already get input in numpy array format, method5 can better written in:改进已接受答案的基准,我发现,假设您已经以numpy数组格式获得输入,method5 可以更好地编写为:

import numpy as np
from numba import jit

@jit(nopython=True)
def euclidian_distance(y1, y2):
    return np.sqrt(np.sum((y1-y2)**2)) # based on pythagorean

Speed test:速度测试:

euclidian_distance(y1, y2)
# 2.03 µs ± 138 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

np.linalg.norm(y1-y2)
# 17.6 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Fun fact, you can add jit to numpy function:有趣的事实,您可以将jit添加到numpy函数:

@jit(nopython=True)
def jit_linalg(y1, y2):
    return np.linalg.norm(y1-y2)

jit_linalg(y[i],y[j])
# 2.91 µs ± 261 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM