对于相同的数据和簇数，不同的轮廓分数

Question

I would like to choose an optimal number of clusters for my dataset using silhouette score. 我想使用剪影得分为我的数据集选择最佳聚类数。 My data set are information about 2,000+ brands, including number of customers purchased this brand, sales for the brand and number of goods the brand sells under each category. 我的数据集是有关2,000多个品牌的信息，包括购买该品牌的客户数量，该品牌的销售额以及该品牌在每个类别下销售的商品数量。

Since my data set is quite sparse, I've used MaxAbsScaler and TruncatedSVD before clustering. 由于我的数据集非常稀疏，因此在进行聚类之前，我已经使用了MaxAbsScaler和TruncatedSVD。

The clustering method I use is k-means since I'm most familiar with this one (I would appreciate your suggestion on other clustering method). 我使用的聚类方法是k均值，因为我对此最熟悉（我很感谢您对其他聚类方法的建议）。

When I set the cluster number to 80 and run k-means, I got different silhouette score each time. 当我将群集号设置为80并运行k-means时，每次获得的轮廓分数都不同。 Is it because k-means gives different clusters each time? 是因为k均值每次给出不同的聚类吗？ Sometimes silhouette score for a cluster number of 80 is less than 200 and sometimes it's the opposite. 有时，群集数为80的轮廓得分小于200，有时相反。 So I'm confused about how to choose a reasonable number of clusters. 因此，我对如何选择合理数量的群集感到困惑。

Besides, the range of my silhouette score is quite small and doesn't change a lot as I increase the number of clusters, which ranges from 0.15 to 0.2. 此外，我的轮廓分数范围很小，并且随着我增加聚类数（从0.15到0.2）而变化不大。

Here is the result I got from running Silhouette score: 这是我通过运行Silhouette得分得到的结果：

For n_clusters=80, The Silhouette Coefficient is 0.17329035592930178
For n_clusters=100, The Silhouette Coefficient is 0.16970208098407866
For n_clusters=200, The Silhouette Coefficient is 0.1961679920561574
For n_clusters=300, The Silhouette Coefficient is 0.19367019831221857
For n_clusters=400, The Silhouette Coefficient is 0.19818865972762675
For n_clusters=500, The Silhouette Coefficient is 0.19551544844885604
For n_clusters=600, The Silhouette Coefficient is 0.19611760638136203

I would much appreciate your suggestions! 非常感谢您的建议！ Thanks in advance! 提前致谢！

Answer 1

Yes, k-means is randomized, so it doesn't always give the same result. 是的，k均值是随机的，因此它并不总是给出相同的结果。

Usually that means this k is NOT good. 通常，这意味着k不好。

But don't blindly rely on silhouette . 但是不要盲目地依赖剪影 。 It's not reliable enough to find the "best" k. 找到“最佳” k还不够可靠。 Largely, because there usually is no best k at all. 在很大程度上，因为通常根本没有最佳k 。

Look at the data, and use your understanding to choose a good clustering instead. 查看数据，并根据您的理解选择一个好的聚类。 Don't expect anything good to come out automatically. 不要指望有什么好东西会自动出现。

Answer 2

I think you are using sklearn so setting the random_state parameter to a number should let you have reproducible results for different executions of k-means for the same k. 我认为您正在使用sklearn，因此将random_state参数设置为一个数字应该可以让您在相同k的k均值的不同执行情况下获得可重现的结果。 You can set that number to 0, 42 or whatever you want just keep the same number for different runs of your code and the results will be the same. 您可以将该数字设置为0、42或任何其他值，只是为了在不同的代码运行中保留相同的数字，结果将是相同的。

对于相同的数据和簇数，不同的轮廓分数

问题描述

2 个解决方案

解决方案1
2 2017-08-31 19:23:55

解决方案2
0 2017-09-06 02:58:29

对于相同的数据和簇数，不同的轮廓分数

问题描述

2 个解决方案

解决方案1 2 2017-08-31 19:23:55

解决方案2 0 2017-09-06 02:58:29

解决方案1
2 2017-08-31 19:23:55

解决方案2
0 2017-09-06 02:58:29