简体   繁体   中英

How to use precomputed distance matrix in new version of kmeans in sklearn?

I am computing my own distance matrix as follows and I want to use it for clustering.

import numpy as np
from math import pi

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

#generate distance matrix from each point
dist = points_rad[None,:] - points_rad[:, None]

#Assign shortest distances from each point
dist[((dist > pi) & (dist <= (2*pi)))] = dist[((dist > pi) & (dist <= (2*pi)))] -(2*pi)
dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] = dist[((dist > (-2*pi)) & (dist <= (-1*pi)))] + (2*pi) 
dist = abs(dist)

#check dist
print(dist)

My distance matrix looks as follows.

[[0.         0.43633231 2.18166156 2.43909763 2.61799388]
 [0.43633231 0.         1.74532925 2.00276532 2.18166156]
 [2.18166156 1.74532925 0.         0.25743606 0.43633231]
 [2.43909763 2.00276532 0.25743606 0.         0.17889625]
 [2.61799388 2.18166156 0.43633231 0.17889625 0.        ]]

I want to have 2 clusters (eg, cluster 1: 0,1 and cluster 2: 2,3,4) using kmeans for above precomputed distance matrix.

When I check kmeans documentation it seeems like precomputed distances are deprecated -> precompute_distances='deprecated' .

Link to documentation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

I am wondering what are the other options that I can look into to perform kmeans using my precomputed distance matrix.

I am happy to provide more details if needed

kMeans needs distances to the centroids ("means") of the clusters (at each iteration), not the pairwise distances between points. So unlike eg k-nearest-neighbors, having this data precomputed won't help*. The meaning of the deprecated parameter here precompute_distances was instead whether to compute all the point-center distances first, or in-loop; for details see PR11950 . That PR made a performance enhancement that obviated the need for this parameter.

* Well, I could see perhaps that there could be a speedup if the data were put into a search structure like BallTree (again see k-neighbors) so that not all the point-centroid distances needed to be computed; but it's not clear how much this could help, and would only really be useful when k was quite large I think. At any rate, that's not being done here.

Do you really want to use your own distance matrix for clustering if you're going to end up feeding the results to sklearn anyways? If not, then you can use KMeans on your dataset directly by reshaping your points matrix to a (-1, 1) array (numpy uses -1 as a sort of filler to return a reshape of the length of the original axis)

import numpy as np
from math import pi
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

#points containing time value in minutes
points = [100, 200, 600, 659, 700]

def convert_to_radian(x):
    return((x / (24 * 60)) * 2 * pi)

rad_function = np.vectorize(convert_to_radian)
points_rad = rad_function(points)

lbls = KMeans(n_clusters=2).fit_predict(points_rad.reshape((-1,1)))
print(lbls) # prints the following: [0 0 1 1 1]

fig, ax = plt.subplots()

ax.scatter(points_rad, points_rad, c=lbls)

plt.show()

在此处输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM