简体   繁体   English

如何使用 KMeans 对多维和未知数据进行聚类?

[英]How to Cluster Multidimentional and Unkown Data using KMeans?

I have two questions regarding Kmeans Clustering using Python.我有两个关于使用 Python 进行 Kmeans 聚类的问题。

I have an auto-generated data called Mystery.npy and its shape is (30309, 784).我有一个名为 Mystery.npy 的自动生成的数据,它的形状是 (30309, 784)。 I am trying to apply the KMeans clustering on it but, I am getting the following error:我正在尝试对其应用 KMeans 聚类,但出现以下错误:

valueerror: the truth value of an array with more than one element is ambiguous. use a.any() or a.all()

Do you have any idea how to overcome this error, or how to cluster such data with KMeans method?您是否知道如何克服此错误,或者如何使用 KMeans 方法对此类数据进行聚类?

The second question, Is there a certain code to know the type of data that I have?第二个问题,是否有特定的代码可以知道我拥有的数据类型?

Your assistance is highly appreciated.非常感谢您的帮助。 Thanks,谢谢,

@Nael Alsaleh, you can run K-Means the following way: @Nael Alsaleh,您可以通过以下方式运行 K-Means:

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt

X=np.load('Mistery.npy')

wx = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, random_state = 0)
    kmeans.fit(X)
    wx.append(kmeans.inertia_)
plt.plot(range(1, 11), wx)
plt.xlabel('Number of clusters')
plt.ylabel('Variance Explained')
plt.show()

由簇数解释的方差

Note that X is a numpy array.请注意, X是一个 numpy 数组。 This code will create the elbow curve, where you can select the perfect number of clusters, in this case, 5-6.此代码将创建弯头曲线,您可以在其中 select 完美的簇数,在本例中为 5-6。

If you are working with numpy, you will have an array:如果您正在使用 numpy,您将拥有一个数组:

array([0.86992608, 0.11252552, 0.25573737, ..., 0.32652233, 0.14927118,
        0.1662449 ])

You may also be working with a list,您可能还使用列表,

[0.86992608, 0.11252552, 0.25573737, ..., 0.32652233, 0.14927118,
        0.1662449 ]

that you will need to convert to array : np.array(X) , or even a Pandas Dataframe:您需要转换为arraynp.array(X) ,甚至是Pandas Dataframe:

在此处输入图像描述

You can check column types in a Pandas Dataframe by doing:您可以通过执行以下操作检查Pandas Dataframe 中的列类型:

import pandas as pd
pd.DataFrame(X).dtypes

In numpy , x.dtypenumpy , x.dtype

After converting data to an array, run:将数据转换为数组后,运行:

n=5
kmeans=KMeans(n_clusters=n, random_state=20).fit(X)
labels_of_clusters = kmeans.fit_predict(X)

This will get you the number of the cluster class that each example belongs.这将为您提供每个示例所属的集群 class 的编号。

array([1, 4, 0, 0, 4, 1, 4, 0, 2, 0, 0, 4, 3, 1, 4, 2, 2, 3, 0, 1, 1, 0,
       4, 4, 2, 0, 3, 0, 3, 1, 1, 2, 1, 0, 2, 4, 0, 3, 2, 1, 1, 2, 2, 2,
       2, 0, 0, 4, 1, 3, 1, 0, 1, 4, 1, 0, 0, 0, 2, 0, 1, 2, 2, 1, 2, 2,
       0, 4, 4, 4, 4, 3, 1, 2, 1, 2, 2, 1, 1, 3, 4, 3, 3, 1, 0, 1, 2, 2,
       1, 2, 3, 1, 3, 3, 4, 2, 2, 0, 2, 1, 3, 4, 2, 0, 2, 1, 3, 3, 3, 4,
       3, 1, 4, 4, 4, 2, 0, 3, 2, 0, 1, 2, 2, 0, 3, 1, 1, 1, 4, 0, 2, 2,
       0, 0, 1, 1, 0, 3, 0, 2, 2, 1, 2, 2, 4, 0, 1, 0, 3, 1, 4, 4, 0, 4,
       1, 2, 0, 2, 4, 0, 1, 2, 3, 1, 1, 0, 3, 2, 4, 0, 1, 3, 1, 2, 4, 3,
       1, 1, 2, 0, 0, 2, 3, 1, 3, 4, 1, 2, 2, 0, 2, 1, 4, 3, 1, 0, 3, 2,
       4, 1, 4, 1, 4, 4, 0, 4, 4, 3, 1, 3, 4, 0, 4, 2, 1, 1, 3, 4, 0, 4,
       4, 4, 4, 2, 4, 2, 3, 4, 3, 3, 1, 1, 4, 2, 3, 0, 2, 4])

To visualize:可视化:

from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=200, centers=4,
                       cluster_std=0.60, random_state=0)

kmeans = KMeans(n_clusters=4, random_state=0).fit(X)
cc=kmeans.fit_predict(X)

plt.scatter(X[:, 0], X[:, 1], c=cc, s=50, cmap='viridis')

K-均值

What you want to do can be done using scikit-learns KMeans module, here is a working example using your data:您可以使用scikit- learns KMeans 模块来完成您想要做的事情,这是一个使用您的数据的工作示例:

import numpy as np
from sklearn.cluster import KMeans
# loading your data from .npy-file
mystery = np.load('mystery.npy')
# n_clusters is a hyperparameter set by you
kmeans = KMeans(n_clusters=42, n_jobs=-1).fit(mystery[:1000])
pred = kmeans.predict(mystery[1000:1200])
print(pred)
array([36, 16, 21, 15, 15,  0,  5,  7, 31, 33, 10, 14,  1, 36, 30, 22, 12,
        1, 35, 12, 16, 12, 28, 14, 13, 15,  2, 21, 36,  7,  7,  4, 39,  4,
        4, 18,  5, 31, 17,  2,  2, 26, 38, 34, 34, 36, 13, 13, 26,  1, 26,
        8, 38,  0, 38, 34,  0, 21, 36, 12, 16, 38, 23, 15,  0,  6, 34,  0,
       19,  7,  8, 21, 16, 36, 24,  0,  4, 22, 33, 21, 12, 12,  2, 10, 23,
        2,  3,  0, 12,  0, 24, 21, 12, 33,  4, 14, 34, 10, 21,  0, 33, 26,
       36,  2, 12, 34, 29, 27, 33,  3, 12, 12, 15, 39, 34, 26, 26, 16,  8,
        2, 12,  0, 21, 15, 40, 16, 38, 22, 26, 36, 17,  3, 12,  3, 23, 39,
       34, 36, 33, 38, 15, 21,  7, 34, 23, 33, 34, 33, 26, 34, 26, 30, 16,
        2,  3,  0, 33, 34, 39, 12,  5, 34, 26, 33, 30, 39, 12,  2, 15, 29,
       12, 38, 36, 10, 36, 28,  1, 19, 12, 17, 32, 35, 11, 16, 28, 18, 14,
       15, 31, 34, 19,  0, 17, 12, 11, 39, 18, 26, 31,  0], dtype=int32)

If you want to use the full data set, kmeans.fit(mystery) may take some time, for testing purposes I used only the first 1000 instances and predicted the foloowing 200 instances.如果您想使用完整的数据集, kmeans.fit(mystery)可能需要一些时间,出于测试目的,我只使用了前 1000 个实例并预测了接下来的 200 个实例。

Do you have any idea how to overcome this error, or how to cluster such data with KMeans method?您是否知道如何克服此错误,或者如何使用 KMeans 方法对此类数据进行聚类?

The first questions is not really realated to scikit-learn, you can find some explanations here: ValueError: The truth value of an array with more than one element is ambiguous.第一个问题与 scikit-learn 无关,您可以在这里找到一些解释: ValueError: The truth value of an array with multiple element is ambiguous。 Use a.any() or a.all() 使用 a.any() 或 a.all()

and here: ValueError: The truth value of an array with more than one element is ambiguous.在这里: ValueError:具有多个元素的数组的真值是不明确的。 Use a.any() or a.all(): Silhouette performance algorithm 使用 a.any() 或 a.all():剪影性能算法

You are doing some syntax problems...你正在做一些语法问题......

The second question, Is there a certain code to know the type of data that I have?第二个问题,是否有特定的代码可以知道我拥有的数据类型?

How to handle.npy files you can check here: What is the way data is stored in *.npy?如何处理.npy 文件,您可以在此处查看: 数据在 *.npy 中的存储方式是什么?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM