简体   繁体   中英

Fastest way to find norm of difference of vectors in Python

I have a list of pairs (say ' A '), and two arrays , ' B ' and ' C ' ( each array has three columns ). The arrays 'B' and 'C 'are collections of coordinates / vectors (3 dimensions). 'A' is a list of pairs of indices; the first entry in each pair denotes the index of a row in B and the second entry in each pair points to the index of a row in C . As an example:

A = [(0, 0), (0, 1), (0, 3), (1, 2), (2, 2)]
B = [[ 0.1  0.4  0.5]
    [ 0.7  0.0  0.4]
    [ 0.8  0.4  0.7]
    [ 0.9  0.3  0.8]]
C = [[ 0.9  0.8  0.9]
    [ 0.3  0.9  0.5]
    [ 0.3  0.4  0.8]
    [ 0.5  0.4  0.3]]

For each pair in the list A, I wish to find the euclidean norm of the difference of vectors in B and C. That is to say, for the pair (i,j), I wish to find Norm( B[i,:]-C[j,:] ). I wish to do this for all pairs in the fastest way possible.

Presently, I am doing it in the following manner:

import numpy as np
map( lambda x: np.sqrt( (B[x[0]] - C[x[1]]).dot(B[x[0]] - C[x[1]]) ), A)

I find the above technique to be somewhat faster than:

map( lambda x: np.linalg.norm((B[x[0]] - C[x[1]])), A) 

I am dealing with arrays B and C which have millions of rows . Could anyone tell me how I can obtain the above functionality in the fastest way possible in python ?

Try (time-measuring style probably only for python >= 3.3):

import numpy as np
import time

A = np.array([[0, 0], [0, 1], [0, 3], [1, 2], [2, 2]])
B = np.array([[ 0.1,  0.4,  0.5],
    [ 0.7,  0.0,  0.4],
    [ 0.8,  0.4,  0.7],
    [ 0.9,  0.3,  0.8]])
C = np.array([[ 0.9,  0.8,  0.9],
    [ 0.3,  0.9,  0.5],
    [ 0.3,  0.4,  0.8],
    [ 0.5,  0.4,  0.3]])

# your approach A
start = time.perf_counter()
print(list(map( lambda x: np.sqrt( (B[x[0]] - C[x[1]]).dot(B[x[0]] - C[x[1]]) ), A)))  # outer list because of py3
print('used: ', time.perf_counter() - start)

# your approach B
start = time.perf_counter()
print(list(map( lambda x: np.linalg.norm((B[x[0]] - C[x[1]])), A)))  # outer list because of py3
print('used: ', time.perf_counter() - start)

# new approach
start = time.perf_counter()
print(np.linalg.norm(B[A[:,0]] - C[A[:,1]], axis=1))
print('used: ', time.perf_counter() - start)

Output:

[0.97979589711327131, 0.53851648071345037, 0.44721359549995798, 
0.69282032302755092, 0.50990195135927852]
used:  0.0014244442358304925

[0.97979589711327131, 0.53851648071345037, 0.44721359549995798, 
0.69282032302755092, 0.50990195135927852]
used:  0.0004404035049961194

[ 0.9797959   0.53851648  0.4472136   0.69282032  0.50990195]  # probably hidden print-magic which cuts off some digits during print
used:  0.0010253945897682102

These timings are obviously quite input-dependent and it's a very bad benchmark, so measure it yourself for your data!

The new approach is arguably the most clean in regards to numpy-style, although it does not need to be the fastest depending on your data. In general i would expect it to be the approach using most of BLAS and vectorized-processing.

For your bigger vectors, make sure you got a fast and multithreaded BLAS-distribution linked to numpy.

Check this out:

import numpy as np
A = np.random.randint(1000, size=(10000,2))
B = np.random.randn(1000,3)
C = np.random.randn(1000,3)

def f():
    return np.linalg.norm(B[A[:,0]] - C[A[:,1]], axis=1)

%timeit f() # >>> 100000 loops, best of 3: 448 µs ms per loop

def g():
    return map( lambda x: np.sqrt( (B[x[0]] - C[x[1]]).dot(B[x[0]] - C[x[1]]) ), A)

%timeit g() # >>> 10 loops, best of 3: 31.8 ms per loop

assert np.abs(f()-g()).max() < 1e-10

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM