[英]Calculating Euclidean distance with a lot of pairs of points is too slow in Python
The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.主要目标是根据欧几里得距离生成客户相似度,并为每个客户找到最相似的 5 个客户。
I have 400,000 customers data, each of them has 40 attributes.我有 400,000 个客户数据,每个客户都有 40 个属性。 The DataFame looks like:
DataFame 看起来像:
A1 A2 ... A40
0 xx xx ... xx
1 xx xx ... xx
2 xx xx ... xx
... ...
399,999 xx xx ... xx
I first standardize these data by sklearn's StandardScaler.我首先通过 sklearn 的 StandardScaler 对这些数据进行标准化。 Now we get the processed data
X_data
.现在我们得到处理后的数据
X_data
。
So now we have 400,000 customers(points/vectors), each has 40 dimensions.所以现在我们有 400,000 个客户(点/向量),每个有 40 个维度。 So far so good.
到目前为止,一切都很好。
I then use dis = numpy.linalg.norm(ab)
to calculate the distance of each pair of two points.然后我使用
dis = numpy.linalg.norm(ab)
计算每对两点的距离。 The shorter the distance is, the more similar the customers are.距离越短,客户越相似。
What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together.我的计划是为每个客户计算最相似的 5 个客户,然后将结果组合在一起。 I firstly start from
customer0
to have a try.我首先从
customer0
开始尝试。 But it is already too slow for just one customer.但是对于一个客户来说已经太慢了。 Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.
即使我通过 PCA 从 sklearn.decomposition 将 40 维减少到 2 维,它仍然太慢。
result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
dis = numpy.linalg.norm(X_data[0]-X_data[i])
result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value
The result look like this, it shows the 5 most similar customers of customer0
:结果如下所示,它显示了 customer0 的 5 个最相似的
customer0
:
index1 index2 distance
0 0 206391 0.004
1 0 314234 0.006
2 0 89284 0.007
3 0 124826 0.012
4 0 234513 0.013
So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop.因此,为了获得所有 400,000 个客户的结果,我可以在这个 for 循环之外再放一个 for 循环。 But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only
customer0
, not to mention all the customers.但问题是,在这种情况下,即使我只为
customer0
计算最多 5 个相似客户,它已经很慢了,更不用说所有客户了。 What should I do to get it faster?我应该怎么做才能更快地获得它? Any idea?
任何想法?
You should not be using loc
.您不应该使用
loc
。 Pandas relies on vectorized operations. Pandas 依赖于矢量化操作。 Suppose you want to calculate the very basic 1D euclidean distance between a series and a single point, called
target
:假设您要计算一个序列和一个点之间的非常基本的一维欧几里德距离,称为
target
:
import pandas as pd
series = pd.Series(list(range(1000)))
target = 10
Using loc:使用位置:
for i in range(len(series)):
(series.loc[i] - target)**2
Versus how it's supposed to be done in pandas与它应该如何在 pandas 中完成
(series-target)**2
Using timeit
:使用
timeit
:
print(timeit.timeit(loc_version, setup=setup, number=1000)) # 6.433
print(timeit.timeit(pandas_version, setup=setup, number=1000)) # 0.158
In this case, that's about 43x slower using loc
.在这种情况下,使用
loc
大约慢 43 倍。 Note that I don't take the square root because for real numbers x
and y
if x^2 < y^2
then |x| < |y|
请注意,我不取平方根,因为对于实数
x
和y
,如果x^2 < y^2
则|x| < |y|
|x| < |y|
. .
I hate when people don't include their timeit code:我讨厌人们不包括他们的 timeit 代码:
setup = """
import pandas as pd
series = pd.Series(list(range(1000)))
target = 10
"""
loc_version = """
for i in range(len(series)):
(series.loc[i] - target)**2
"""
pandas_version = """
(series-target)**2
"""
import timeit
print(timeit.timeit(loc_version, setup=setup, number=1000))
print(timeit.timeit(pandas_version, setup=setup, number=1000))
This shows those two results are equal这表明这两个结果是相等的
import pandas as pd
import numpy as np
series = pd.Series(list(range(1000)))
target = 10
result = []
for i in range(len(series)):
result.append((series.loc[i] - target)**2)
(series-target)**2
print(np.allclose(result, (series-target)**2))
Use efficient implementation of scikit:使用 scikit 的高效实现:
sklearn.metrics.pairwise_distances(X)
which will returns;哪个会返回; a distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X.
距离矩阵 D 使得 D_{i, j} 是给定矩阵 X 的第 i 个和第 j 个向量之间的距离。
Then you can use np.argpartition(D, k)
to access the top k indices.然后您可以使用
np.argpartition(D, k)
访问前 k 个索引。 Or simply based on the scikit docs and @bb1 comment:或者简单地基于 scikit 文档和 @bb1 评论:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(n_neighbors=2, radius=0.4)
neigh.fit(samples)
neigh.kneighbors(samples, 2, return_distance=False)[:,1]
Using numpy vectorized operations you can avoid both the for loops.使用 numpy 矢量化操作,您可以避免两个 for 循环。 I will take a shorter example, which you can easily extrapolate.
我将举一个简短的例子,你可以很容易地推断出来。 Suppose I have 3 data points(400,000 in your case) each 4 dimensional (40 dimensional in your case).
假设我有 3 个数据点(在您的情况下为 400,000),每个 4 维(在您的情况下为 40 维)。
a = np.array([2,4,5,6])
b = np.array([3,5,6,7])
c = np.array([4,6,7,8])
d = np.vstack([a,b,c])
d.shape
(3,4)
now calculate the outer difference of 3 vectors a,b,c in d现在计算 d 中 3 个向量 a,b,c 的外差
[[a, b, c]]
[a, a-a a-b a-c
b, b-a b-b b-c
c, c-a c-b c-a
]
imagine all of the vectors extending into the 3rd dimension (perpendicular to the screen).想象所有延伸到第三维(垂直于屏幕)的向量。 Euclidean distance between 2 vectors x and y is norm(xy).
2 个向量 x 和 y 之间的欧几里得距离是 norm(xy)。 So what we want is norm of this matrix along axis = 2
所以我们想要的是这个矩阵沿axis = 2的范数
This matrix can be generated by broadcasting d with reshaped version of d with shape (3,1,4)这个矩阵可以通过广播 d 和形状为 (3,1,4) 的 d 的重塑版本来生成
v = d - d.reshape((3,1,4))
v
array([[[ 0, 0, 0, 0],
[ 1, 1, 1, 1],
[ 2, 2, 2, 2]],
[[-1, -1, -1, -1],
[ 0, 0, 0, 0],
[ 1, 1, 1, 1]],
[[-2, -2, -2, -2],
[-1, -1, -1, -1],
[ 0, 0, 0, 0]]])
Notice the rows of 0s in 3 matrices.注意 3 个矩阵中的 0 行。 Now we want to find the norm of this matrix along axis = 2
现在我们想沿着axis = 2找到这个矩阵的范数
np.linalg.norm(v,axis=2)
array([[0., 2., 4.],
[2., 0., 2.],
[4., 2., 0.]])
Now all we have to do is find n largest numbers along axis = 1. There are many methods to do that, for which please refer to this question, depending on whether you want just the 5 nearest values as well as the indices.现在我们要做的就是找到沿轴 = 1 的 n 个最大数。有很多方法可以做到这一点,请参考这个问题,具体取决于您是否只需要 5 个最接近的值以及索引。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.