简体   繁体   English

对于相同的数据和簇数,不同的轮廓分数

[英]Different silhouette scores for the same data and number of clusters

I would like to choose an optimal number of clusters for my dataset using silhouette score. 我想使用剪影得分为我的数据集选择最佳聚类数。 My data set are information about 2,000+ brands, including number of customers purchased this brand, sales for the brand and number of goods the brand sells under each category. 我的数据集是有关2,000多个品牌的信息,包括购买该品牌的客户数量,该品牌的销售额以及该品牌在每个类别下销售的商品数量。

Since my data set is quite sparse, I've used MaxAbsScaler and TruncatedSVD before clustering. 由于我的数据集非常稀疏,因此在进行聚类之前,我已经使用了MaxAbsScaler和TruncatedSVD。

The clustering method I use is k-means since I'm most familiar with this one (I would appreciate your suggestion on other clustering method). 我使用的聚类方法是k均值,因为我对此最熟悉(我很感谢您对其他聚类方法的建议)。

When I set the cluster number to 80 and run k-means, I got different silhouette score each time. 当我将群集号设置为80并运行k-means时,每次获得的轮廓分数都不同。 Is it because k-means gives different clusters each time? 是因为k均值每次给出不同的聚类吗? Sometimes silhouette score for a cluster number of 80 is less than 200 and sometimes it's the opposite. 有时,群集数为80的轮廓得分小于200,有时相反。 So I'm confused about how to choose a reasonable number of clusters. 因此,我对如何选择合理数量的群集感到困惑。

Besides, the range of my silhouette score is quite small and doesn't change a lot as I increase the number of clusters, which ranges from 0.15 to 0.2. 此外,我的轮廓分数范围很小,并且随着我增加聚类数(从0.15到0.2)而变化不大。

Here is the result I got from running Silhouette score: 这是我通过运行Silhouette得分得到的结果:

For n_clusters=80, The Silhouette Coefficient is 0.17329035592930178
For n_clusters=100, The Silhouette Coefficient is 0.16970208098407866
For n_clusters=200, The Silhouette Coefficient is 0.1961679920561574
For n_clusters=300, The Silhouette Coefficient is 0.19367019831221857
For n_clusters=400, The Silhouette Coefficient is 0.19818865972762675
For n_clusters=500, The Silhouette Coefficient is 0.19551544844885604
For n_clusters=600, The Silhouette Coefficient is 0.19611760638136203

I would much appreciate your suggestions! 非常感谢您的建议! Thanks in advance! 提前致谢!

Yes, k-means is randomized, so it doesn't always give the same result. 是的,k均值是随机的,因此它并不总是给出相同的结果。

Usually that means this k is NOT good. 通常,这意味着k不好。

But don't blindly rely on silhouette . 但是不要盲目地依赖剪影 It's not reliable enough to find the "best" k. 找到“最佳” k还不够可靠。 Largely, because there usually is no best k at all. 在很大程度上,因为通常根本没有最佳k

Look at the data, and use your understanding to choose a good clustering instead. 查看数据,并根据您的理解选择一个好的聚类。 Don't expect anything good to come out automatically. 不要指望有什么好东西会自动出现。

I think you are using sklearn so setting the random_state parameter to a number should let you have reproducible results for different executions of k-means for the same k. 我认为您正在使用sklearn,因此将random_state参数设置为一个数字应该可以让您在相同k的k均值的不同执行情况下获得可重现的结果。 You can set that number to 0, 42 or whatever you want just keep the same number for different runs of your code and the results will be the same. 您可以将该数字设置为0、42或任何其他值,只是为了在不同的代码运行中保留相同的数字,结果将是相同的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 sklearn 和 Spark 时的轮廓分数不同 - Different silhouette scores when using sklearn and Spark sklearn:相同聚类的轮廓分数不同 - sklearn: silhouette score different for same clustering K-Means 聚类 - output 聚类包含相同数量但顺序不同的元素 [ Python ] - K-Means Clustering - output clusters contains same number of elements but in different order [ Python ] 在相同数据和相同算法上进行评估时,为什么会产生两个不同的AUC分数 - Why two different AUC scores are produced when evaluated on same data and same algorithm 为相同或不同集群中的数据点创建矩阵 - Create a matrix for datapoints in same or different clusters 如何比较大型数据集的集群数量? - How to compare the number of clusters for large data sets? 找到具有缩放和非缩放数据的最佳集群数量的问题 - Problem to find the optimal number of clusters with scaled and non-scaled data 如何将元组列表转换为具有不同分数的数据框 - how to turn list of tuples to data frame with different scores 努力编写 function 以在列表中找到相同数据的“集群” - Struggling to write a function that finds 'clusters' of the same data in a list 轮廓图和 PCA 图具有相同的颜色 - Having the same color for a silhouette plot and for a PCA plot
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM