简体   繁体   中英

Calculating Euclidean distance with a lot of pairs of points is too slow in Python

The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.

I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:

          A1 A2 ... A40
0         xx xx ... xx
1         xx xx ... xx
2         xx xx ... xx
...       ...
399,999   xx xx ... xx

I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data .

So now we have 400,000 customers(points/vectors), each has 40 dimensions. So far so good.

I then use dis = numpy.linalg.norm(ab) to calculate the distance of each pair of two points. The shorter the distance is, the more similar the customers are.

What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together. I firstly start from customer0 to have a try. But it is already too slow for just one customer. Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.

result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
    dis = numpy.linalg.norm(X_data[0]-X_data[i])
    result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value

The result look like this, it shows the 5 most similar customers of customer0 :

  index1 index2 distance
0   0    206391  0.004
1   0    314234  0.006
2   0    89284   0.007
3   0    124826  0.012
4   0    234513  0.013

So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop. But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only customer0 , not to mention all the customers. What should I do to get it faster? Any idea?

You should not be using loc . Pandas relies on vectorized operations. Suppose you want to calculate the very basic 1D euclidean distance between a series and a single point, called target :

import pandas as pd
series = pd.Series(list(range(1000)))
target = 10

Using loc:

for i in range(len(series)):
    (series.loc[i] - target)**2

Versus how it's supposed to be done in pandas

(series-target)**2

Using timeit :

print(timeit.timeit(loc_version, setup=setup, number=1000)) # 6.433
print(timeit.timeit(pandas_version, setup=setup, number=1000)) # 0.158

In this case, that's about 43x slower using loc . Note that I don't take the square root because for real numbers x and y if x^2 < y^2 then |x| < |y| |x| < |y| .

Full timeit code

I hate when people don't include their timeit code:

setup = """
import pandas as pd
series = pd.Series(list(range(1000)))
target = 10
"""

loc_version = """
for i in range(len(series)):
    (series.loc[i] - target)**2
"""

pandas_version = """
(series-target)**2
"""

import timeit
print(timeit.timeit(loc_version, setup=setup, number=1000))
print(timeit.timeit(pandas_version, setup=setup, number=1000))

Result Equality

This shows those two results are equal

import pandas as pd
import numpy as np
series = pd.Series(list(range(1000)))
target = 10

result = []
for i in range(len(series)):
    result.append((series.loc[i] - target)**2)

(series-target)**2

print(np.allclose(result, (series-target)**2))

Use efficient implementation of scikit:

sklearn.metrics.pairwise_distances(X)

which will returns; a distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X.

Then you can use np.argpartition(D, k) to access the top k indices. Or simply based on the scikit docs and @bb1 comment:

import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(n_neighbors=2, radius=0.4)
neigh.fit(samples)
neigh.kneighbors(samples, 2, return_distance=False)[:,1]

Using numpy vectorized operations you can avoid both the for loops. I will take a shorter example, which you can easily extrapolate. Suppose I have 3 data points(400,000 in your case) each 4 dimensional (40 dimensional in your case).

a = np.array([2,4,5,6])
b = np.array([3,5,6,7])
c = np.array([4,6,7,8])
d = np.vstack([a,b,c])
d.shape
(3,4)

now calculate the outer difference of 3 vectors a,b,c in d

  [[a,     b,     c]]   

[a,  a-a   a-b    a-c

 b,  b-a   b-b    b-c

 c,  c-a   c-b    c-a
]

imagine all of the vectors extending into the 3rd dimension (perpendicular to the screen). Euclidean distance between 2 vectors x and y is norm(xy). So what we want is norm of this matrix along axis = 2

This matrix can be generated by broadcasting d with reshaped version of d with shape (3,1,4)

v = d - d.reshape((3,1,4))
v

array([[[ 0,  0,  0,  0],
     [ 1,  1,  1,  1],
     [ 2,  2,  2,  2]],

   [[-1, -1, -1, -1],
    [ 0,  0,  0,  0],
    [ 1,  1,  1,  1]],

   [[-2, -2, -2, -2],
    [-1, -1, -1, -1],
    [ 0,  0,  0,  0]]])

Notice the rows of 0s in 3 matrices. Now we want to find the norm of this matrix along axis = 2

np.linalg.norm(v,axis=2)

array([[0., 2., 4.],
   [2., 0., 2.],
   [4., 2., 0.]])

Now all we have to do is find n largest numbers along axis = 1. There are many methods to do that, for which please refer to this question, depending on whether you want just the 5 nearest values as well as the indices.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM