简体   繁体   English

获取离质心最近的点,scikit-learn?

[英]Get nearest point to centroid, scikit-learn?

I am using K-means for a clustering problem.我正在使用 K-means 来解决聚类问题。 I am trying to find the data point which is most close to the centroid, which I believe is called the medoid.我试图找到最接近质心的数据点,我相信它被称为 medoid。

Is there a way to do this in scikit-learn?有没有办法在 scikit-learn 中做到这一点?

This is not the medoid, but here's something you can try:这不是medoid,但您可以尝试以下方法:

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> from sklearn.metrics import pairwise_distances_argmin_min
>>> X = np.random.randn(10, 4)
>>> km = KMeans(n_clusters=2).fit(X)
>>> closest, _ = pairwise_distances_argmin_min(km.cluster_centers_, X)
>>> closest
array([0, 8])

The array closest contains the index of the point in X that is closest to each centroid.数组closest包含X中最接近每个质心的点的索引。 So X[0] is the closest point in X to centroid 0, and X[8] is the closest to centroid 1.所以X[0]X离质心0最近的点, X[8]是离质心1最近的点。

I tried the above answer but it gives me duplicates in the result.我尝试了上面的答案,但它给了我重复的结果。 The above finds the closest data point regardless the clustering results.无论聚类结果如何,以上都会找到最近的数据点。 Hence it can return duplicates of the same cluster.因此它可以返回同一个集群的副本。

If you want to find the closest data within the same cluster that the center indicates , try this.如果您想在中心指示的同一集群中找到最接近的数据,请尝试此操作。

This solution gives the data points are from all different clusters and also the number of returned data points is same as the number of clusters.该解决方案给出的数据点来自所有不同的集群,并且返回的数据点数量与集群数量相同。

import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import pairwise_distances_argmin_min

# assume the total number of data is 100
all_data = [ i for i in range(100) ]
tf_matrix = numpy.random.random((100, 100))

# set your own number of clusters
num_clusters = 2

m_km = KMeans(n_clusters=num_clusters)  
m_km.fit(tf_matrix)
m_clusters = m_km.labels_.tolist()

centers = np.array(m_km.cluster_centers_)

closest_data = []
for i in range(num_clusters):
    center_vec = centers[i]
    data_idx_within_i_cluster = [ idx for idx, clu_num in enumerate(m_clusters) if clu_num == i ]

    one_cluster_tf_matrix = np.zeros( (  len(data_idx_within_i_cluster) , centers.shape[1] ) )
    for row_num, data_idx in enumerate(data_idx_within_i_cluster):
        one_row = tf_matrix[data_idx]
        one_cluster_tf_matrix[row_num] = one_row

    closest, _ = pairwise_distances_argmin_min(center_vec, one_cluster_tf_matrix)
    closest_idx_in_one_cluster_tf_matrix = closest[0]
    closest_data_row_num = data_idx_within_i_cluster[closest_idx_in_one_cluster_tf_matrix]
    data_id = all_data[closest_data_row_num]

    closest_data.append(data_id)

closest_data = list(set(closest_data))

assert len(closest_data) == num_clusters

What you are trying to achieve is basically vector quantization, but in "reverse".您想要实现的基本上是矢量量化,但是“反向”。 Scipy has a very optimized function for that, much faster than the other methods mentioned. Scipy有一个非常优化的功能,比提到的其他方法快得多。 The output is the same as with pairwise_distances_argmin_min() .输出与pairwise_distances_argmin_min()相同。

    from scipy.cluster.vq import vq

    # centroids: N-dimensional array with your centroids
    # points:    N-dimensional array with your data points

    closest, distances = vq(centroids, points)

The big difference comes when you execute it with very big arrays, I executed it with an array of 100000+ points and 65000+ centroids, and this method is 4 times faster than pairwise_distances_argmin_min() from scikit , as shown below:当您使用非常大的数组执行它时,就会有很大的不同,我使用 100000+ 个点和 65000+ 个质心的数组来执行它,这种方法比scikit 中的pairwise_distances_argmin_min ()快 4 倍,如下所示:

     start_time = time.time()
     cl2, dst2 = vq(centroids, points)
     print("--- %s seconds ---" % (time.time() - start_time))
     --- 32.13545227050781 seconds ---

     start_time = time.time()
     cl2, dst2 = pairwise_distances_argmin_min(centroids, points)
     print("--- %s seconds ---" % (time.time() - start_time))
     --- 131.21064710617065 seconds ---

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM