简体   繁体   English

在KMeans聚类之后查找聚类的长度(与聚类关联的点数)(scikit learn)

[英]Find length of cluster (how many point associated with cluster) after KMeans clustering (scikit learn)

I have done clustering using Kmeans using sklearn. 我使用sklearn使用Kmeans进行了聚类。 While it has a method to print the centroids, I am finding it rather bizzare that scikit-learn doesn't have a method to find out the cluster length (or that I have not seen it so far). 虽然它有一种打印质心的方法,但我觉得scikit-learn没有找到簇长度的方法(或者到目前为止我还没有看到它)。 Is there a neat way to get the cluster-length of each cluster or many points associated with cluster? 是否有一种巧妙的方法来获取每个集群的集群长度或与集群相关的许多点? I currently have this rather cludgy code to do it where I am finding cluster of length one and need to add other point to this cluster by measuring the Euclidean distance between the points and have to update the labels 我目前有这个相当繁琐的代码,我发现它的长度为1的簇,需要通过测量点之间的欧几里德距离来添加其他点,并且必须更新标签

import numpy as np
from clustering.clusternew import Kmeans_clu
from evolution.generate import reproduction
from mapping.somnew import mapping, no_of_neurons, neuron_weights_init
from population_creation.population import pop_create
from New_SOL import newsol


data = genfromtxt('iris.csv', delimiter=',', skip_header=0, usecols=range(0, 4)) ##Read the input data
actual_label = genfromtxt('iris.csv', delimiter=',', dtype=str,skip_header=0, usecols=(4))
chromosome = int(input("Enter the number of chromosomes: "))  #Input the population size
max_gen = int(input("Enter the maximum number of generation: "))  #Input the maximum number of generation

for i in range(0, chromosome):
    cluster = 3#random.randint(2, max_cluster)  ##Randomly selects cluster number from 2 to root(poplation)
    K.insert(i, cluster)  ##Store the number of clusters in clu
    print('value of K is ',K)
    u, label,z1,A1= Kmeans_clu(cluster, data)
    #print("centers and labels : ", u, label)
    lab.insert(i, label)  ##Store the labels in lab
    center.insert(i, u) 
    new_center = pop_create(max_cluster, features, cluster, u)
    population.insert(i, new_center)
    print("VAlue of population in main\n" ,population)

newsol(max_gen,population,data)

For newsol method we pass the new population from the above method generated code and again doing K-Means on the population 对于newsol方法,我们从上面的方法生成的代码传递新的种群,并再次对种群进行K-Means

def ClusterIndicesComp(clustNum, labels_array):   #list comprehension for accessing the features in iris data set 
    return np.array([i for i, x in enumerate(labels_array) if x == clustNum])

def newsol(max_gen,population,data):
    #print('VAlue of NewSol Population is',population)
    for i in range(max_gen):
        cluster1=5
        u,label,t,l=Kmeans_clu(cluster1, population)
        A1.insert(i,t)
        plab.insert(i,label)
        pcenter.insert(i,u)
        k2=Counter(l.labels_)  #Count number of elements in each cluster
        k1=[t for (t, v) in k2.items() if v == 1] #element whose length is one will be fetched 
        t1= np.array(k1) #Iterating through the cluster which have one point associated with them 
        for b in range(len(t1)):
            print("Value in NEW_SOL is of 1 length cluster\n",t1[b])
            plot1=data[ClusterIndicesComp(t1[b], l.labels_)]
            print("Values are in sol of plot1",plot1)
            for q in range(cluster1):
                plot2=data[ClusterIndicesComp(q, l.labels_)]
                print("VAlue of plot2 is for \n",q,plot2)
                for i in range(len(plot2)):#To get one element at a time from plot2
                    plotk=plot2[i]
                    if([t for (t, v) in k2.items() if v >2]):#checking if the cluster have more than 2 points than only the distance will be calculated 
                        S=np.linalg.norm(np.array(plot1) - np.array(plotk))
                        print("Distance between plot1 and plotk is",plot1,plotk,np.linalg.norm(np.array(plot1) - np.array(plotk)))#euclidian distance is calculated 
                    else:
                        print("NO distance between them\n")

Kmeans which I have done is 我做过的Kmeans是

from sklearn.cluster import KMeans
import numpy as np 

def Kmeans_clu(K, data):

    kmeans = KMeans(n_clusters=K, init='random', max_iter=1, n_init=1).fit(data) ##Apply k-means clustering
    labels = kmeans.labels_
    clu_centres = kmeans.cluster_centers_
    z={i: np.where(kmeans.labels_ == i)[0] for i in range(kmeans.n_clusters)} #getting cluster for each label 

    return clu_centres, labels ,z,kmeans

For getting number of instances in each cluster may be you can try using Counter : 为了获得每个集群中的实例数量,您可以尝试使用Counter

from collections import Counter, defaultdict
print(Counter(estimator.labels_))

Result: 结果:

Counter({0: 62, 1: 50, 2: 38})

where cluster 0 has 62 instances, cluster 1 has 50 instances, and cluster 2 has 38 instances 其中集群0有62个实例,集群1有50个实例,集群2有38个实例

And may be to store index of instances of each clusters, you can use defaultdict : 并且可能是存储每个集群的实例索引,可以使用defaultdict

clusters_indices = defaultdict(list)
for index, c  in enumerate(estimator.labels_):
    clusters_indices[c].append(index)

Now, to find indices of instances in cluster 0, calling: 现在,要查找集群0中的实例索引,请调用:

print(clusters_indices[0])

Result: 结果:

[50, 51, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 
 71, 72, 73, 74, 75, 76, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92,
 93, 94, 95, 96, 97, 98, 99, 101, 106, 113, 114, 119, 121, 123, 126, 127, 133, 138, 142, 146, 149]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM