[英]Finding the indices of all points corresponding to a particular centroid using kmeans clustering
Here is a simple implementation of kmeans clustering (with the points in cluster labelled from 1 to 500): 这是kmeans聚类的简单实现(聚类中的点标记为1到500):
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2)))
# computing K-Means with K = 2 (2 clusters)
centroids,_ = kmeans(data,2)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
#ignore this, just labelling each point in cluster
for label, x, y in zip(labels, data[:, 0], data[:, 1]):
plt.annotate(
label,
xy = (x, y), xytext = (-20, 20),
textcoords = 'offset points', ha = 'right', va = 'bottom',
bbox = dict(boxstyle = 'round,pad=0.5', fc = 'yellow', alpha = 0.5),
arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()
I am trying to find the indices for all of the points within each cluster. 我试图找到每个群集中所有点的索引。
You already have that... 你已经有...
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or')
Guess what idx
does, and what data[idx==0]
vs. data[idx==1]
contain. 猜猜
idx
做什么,以及data[idx==0]
与data[idx==1]
包含什么。
In this line: 在这一行:
idx,_ = vq(data,centroids)
you have already generated a vector containing the index of the nearest centroid for each point (row) in your data
array. 您已经生成了一个向量,其中包含
data
数组中每个点(行)最近的质心的索引。
It seems you want the row indices of all of the points that are nearest to centroid 0, centroid 1 etc. You can use np.nonzero
to find the indices where idx == i
where i
is the centroid you are interested in. 似乎您想要所有最接近质心0,质心1等的点的行索引。您可以使用
np.nonzero
来找到索引,其中idx == i
,其中i
是您感兴趣的质心。
For example: 例如:
in_0 = np.nonzero(idx == 0)[0]
in_1 = np.nonzero(idx == 1)[0]
In the comments you also asked why the idx
vector differs across runs. 在注释中,您还询问为什么
idx
向量在运行中会有所不同。 This is because if you pass an integer as the second parameter to kmeans
, the centroid locations are randomly initialized ( see here ). 这是因为如果将整数作为第二个参数传递给
kmeans
,质心位置将被随机初始化( 请参见此处 )。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.