简体   繁体   English

使用kmeans聚类查找与特定质心对应的所有点的索引

[英]Finding the indices of all points corresponding to a particular centroid using kmeans clustering

Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500): 这是kmeans聚类的简单实现(聚类中的点标记为1到500):

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq

# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))

# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)

#ignore this, just labelling each point in cluster
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
    label, 
   xy = (x, y), xytext = (-20, 20),
   textcoords = 'offset points', ha = 'right', va = 'bottom',
   bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
   arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()

I am trying to find the indices for all of the points within each cluster. 我试图找到每个群集中所有点的索引。 没有标签的图像

You already have that... 你已经有...

plot(data[idx==0,0],data[idx==0,1],'ob',
     data[idx==1,0],data[idx==1,1],'or')

Guess what idx does, and what data[idx==0] vs. data[idx==1] contain. 猜猜idx做什么,以及data[idx==0]data[idx==1]包含什么。

In this line: 在这一行:

idx,_ = vq(data,centroids)

you have already generated a vector containing the index of the nearest centroid for each point (row) in your data array. 您已经生成了一个向量,其中包含data数组中每个点(行)最近的质心的索引。

It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use np.nonzero to find the indices where idx == i where i is the centroid you are interested in. 似乎您想要所有最接近质心0,质心1等的点的行索引。您可以使用np.nonzero来找到索引,其中idx == i ,其中i是您感兴趣的质心。

For example: 例如:

in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]

In the comments you also asked why the idx vector differs across runs. 在注释中,您还询问为什么idx向量在运行中会有所不同。 This is because if you pass an integer as the second parameter to kmeans , the centroid locations are randomly initialized ( see here ). 这是因为如果将整数作为第二个参数传递给kmeans ,质心位置将被随机初始化( 请参见此处 )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM