The main goal is to generate the customer similarity based on Euclidean distance, and find the 5 most similar customers for each customer.
I have 400,000 customers data, each of them has 40 attributes. The DataFame looks like:
A1 A2 ... A40
0 xx xx ... xx
1 xx xx ... xx
2 xx xx ... xx
... ...
399,999 xx xx ... xx
I first standardize these data by sklearn's StandardScaler. Now we get the processed data X_data
.
So now we have 400,000 customers(points/vectors), each has 40 dimensions. So far so good.
I then use dis = numpy.linalg.norm(ab)
to calculate the distance of each pair of two points. The shorter the distance is, the more similar the customers are.
What I planed was to calculate the 5 most similar customers for each customer, and then combined the results together. I firstly start from customer0
to have a try. But it is already too slow for just one customer. Even I decrease the 40 dimensions to 2 dimensions by PCA from sklearn.decomposition, it is still too slow.
result=pd.DataFrame(columns=['index1','index2','distance'])
for i in range(len(X_data)):
dis = numpy.linalg.norm(X_data[0]-X_data[i])
result.loc[len(result)]=[0,i,dis]
result=result.sort_values(by=['distance])
result=result[1:6] #pick the first 5 customers starting from the second customer, because the first one is himself with 0 distance value
The result look like this, it shows the 5 most similar customers of customer0
:
index1 index2 distance
0 0 206391 0.004
1 0 314234 0.006
2 0 89284 0.007
3 0 124826 0.012
4 0 234513 0.013
So to get the result for all the 400,000 customers, i can just put another for loop outside this for loop. But the problem is, in this case, it is already so slow even I just calculate the most 5 similar customers for only customer0
, not to mention all the customers. What should I do to get it faster? Any idea?
You should not be using loc
. Pandas relies on vectorized operations. Suppose you want to calculate the very basic 1D euclidean distance between a series and a single point, called target
:
import pandas as pd
series = pd.Series(list(range(1000)))
target = 10
Using loc:
for i in range(len(series)):
(series.loc[i] - target)**2
Versus how it's supposed to be done in pandas
(series-target)**2
Using timeit
:
print(timeit.timeit(loc_version, setup=setup, number=1000)) # 6.433
print(timeit.timeit(pandas_version, setup=setup, number=1000)) # 0.158
In this case, that's about 43x slower using loc
. Note that I don't take the square root because for real numbers x
and y
if x^2 < y^2
then |x| < |y|
|x| < |y|
.
I hate when people don't include their timeit code:
setup = """
import pandas as pd
series = pd.Series(list(range(1000)))
target = 10
"""
loc_version = """
for i in range(len(series)):
(series.loc[i] - target)**2
"""
pandas_version = """
(series-target)**2
"""
import timeit
print(timeit.timeit(loc_version, setup=setup, number=1000))
print(timeit.timeit(pandas_version, setup=setup, number=1000))
This shows those two results are equal
import pandas as pd
import numpy as np
series = pd.Series(list(range(1000)))
target = 10
result = []
for i in range(len(series)):
result.append((series.loc[i] - target)**2)
(series-target)**2
print(np.allclose(result, (series-target)**2))
Use efficient implementation of scikit:
sklearn.metrics.pairwise_distances(X)
which will returns; a distance matrix D such that D_{i, j} is the distance between the ith and jth vectors of the given matrix X.
Then you can use np.argpartition(D, k)
to access the top k indices. Or simply based on the scikit docs and @bb1 comment:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[0, 0, 2], [1, 0, 0], [0, 0, 1]]
neigh = NearestNeighbors(n_neighbors=2, radius=0.4)
neigh.fit(samples)
neigh.kneighbors(samples, 2, return_distance=False)[:,1]
Using numpy vectorized operations you can avoid both the for loops. I will take a shorter example, which you can easily extrapolate. Suppose I have 3 data points(400,000 in your case) each 4 dimensional (40 dimensional in your case).
a = np.array([2,4,5,6])
b = np.array([3,5,6,7])
c = np.array([4,6,7,8])
d = np.vstack([a,b,c])
d.shape
(3,4)
now calculate the outer difference of 3 vectors a,b,c in d
[[a, b, c]]
[a, a-a a-b a-c
b, b-a b-b b-c
c, c-a c-b c-a
]
imagine all of the vectors extending into the 3rd dimension (perpendicular to the screen). Euclidean distance between 2 vectors x and y is norm(xy). So what we want is norm of this matrix along axis = 2
This matrix can be generated by broadcasting d with reshaped version of d with shape (3,1,4)
v = d - d.reshape((3,1,4))
v
array([[[ 0, 0, 0, 0],
[ 1, 1, 1, 1],
[ 2, 2, 2, 2]],
[[-1, -1, -1, -1],
[ 0, 0, 0, 0],
[ 1, 1, 1, 1]],
[[-2, -2, -2, -2],
[-1, -1, -1, -1],
[ 0, 0, 0, 0]]])
Notice the rows of 0s in 3 matrices. Now we want to find the norm of this matrix along axis = 2
np.linalg.norm(v,axis=2)
array([[0., 2., 4.],
[2., 0., 2.],
[4., 2., 0.]])
Now all we have to do is find n largest numbers along axis = 1. There are many methods to do that, for which please refer to this question, depending on whether you want just the 5 nearest values as well as the indices.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.