简体   繁体   English

KMeans对不平衡数据进行聚类

[英]KMeans clustering unbalanced data

I have a set of data with 50 features (c1, c2, c3 ...), with over 80k rows. 我有一组具有50个功能(c1,c2,c3 ...)的数据,具有超过80k行。

Each row contains normalised numerical values (ranging 0-1). 每行包含标准化的数值(范围为0-1)。 It is actually a normalised dummy variable, whereby some rows have only few features, 3-4 (ie 0 is assigned if there is no value). 它实际上是一个归一化的伪变量,其中某些行只有3-4个很少的特征(即,如果没有值,则分配0)。 Most rows have about 10-20 features. 大多数行具有大约10-20个功能。

I used KMeans to cluster the data, always resulting in a cluster with a large number of members. 我使用KMeans对数据进行聚类,总是导致具有大量成员的聚类。 Upon analysis, I noticed that rows with fewer than 4 features tends to get clustered together, which is not what I want. 经过分析,我注意到具有少于4个特征的行往往会聚集在一起,这不是我想要的。

Is there anyway balance out the clusters? 无论如何,集群之间是否平衡?

It is not part of the k-means objective to produce balanced clusters. 产生平衡簇不是k-means 目标的一部分。 In fact, solutions with balanced clusters can be arbitrarily bad (just consider a dataset with duplicates). 实际上, 具有平衡集群的解决方案可能会很糟糕 (只需考虑具有重复项的数据集)。 K-means minimizes the sum-of-squares, and putting these objects into one cluster seems to be beneficial. K均值最小化平方和,将这些对象放在一个群集中似乎是有益的。

What you see is the typical effect of using k-means on sparse, non-continuous data. 您看到的是在稀疏,非连续数据上使用k均值的典型效果。 Encoded categoricial variables, binary variables, and sparse data just are not well suited for k-means use of means . 编码categoricial变量,二元变量,以及稀疏数据只是不适合用于k均值使用的装置 Furthermore, you'd probably need to carefully weight variables, too. 此外,您可能还需要仔细权重变量。

Now a hotfix that will likely improve your results (at least the perceived quality, because I do not think it makes them statistically any better) is to normalize each vector to unit length (Euclidean norm 1). 现在,可能会改善您的结果(至少是感知的质量,因为我认为从统计学上讲它们不会使它们更好)的修补程序是将每个向量归一化为单位长度(欧几里得范数1)。 This will emphasize the ones of rows with few nonzero entries. 这将强调那些具有很少非零条目的行。 You'll probably like the results more, but they are even much harder to interpret. 您可能会更喜欢结果,但是更难解释。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM