简体   繁体   English

对每个聚类大小都有上限要求的聚类算法

[英]Clustering algorithm with upper bound requirement for each cluster size

I need to do a partition of approximately 50000 points into distinct clusters. 我需要将大约50000点划分为不同的群集。 There is one requirement: the size of every cluster cannot exceed K. Is there any clustering algorithm that can do this job? 有一个要求:每个集群的大小不能超过K。是否有任何集群算法可以完成这项工作?

Please note that upper bound, K, of every cluster is the same, say 100. 请注意,每个群集的上限K相同,例如100。

Most clustering algorithms can be used to create a tree in which the lowest level is just a single element - either because they naturally work "bottom up" by joining pairs of elements and then groups of joined elements, or because - like K-Means, they can be used to repeatedly split groups into smaller groups. 大多数聚类算法都可用于创建一棵树,其中的最低层只是一个元素-是因为它们通过先将成对的元素然后再加入成组的元素而自然地“自下而上”地工作,或者是因为-如K-Means,它们可用于将组重复地分成较小的组。

Once you have a tree, you can decide where to split off subtrees to form your clusters of size <= 100. Pruning an existing tree is often quite easy. 拥有一棵树后,您可以决定在何处分割子树以形成大小小于等于100的群集。修剪现有树通常很容易。 Suppose that you want to divide an existing tree to minimise the sum of some cost of the clusters you create. 假设您想划分一棵现有的树以最小化您创建的集群的某些成本之和。 You might have: 你可能有:

f(tree-node, list_of_clusters)
{
  cost = infinity;
  if (size of tree below tree-node <= 100)
  {
    cost = cost_function(stuff below tree-node);
  }
  temp_list = new List();
  cost_children = 0;
  for (children of tree_node)
  {
    cost_children += f(child, temp_list);
  }
  if (cost_children < cost)
  {
    list_of_clusters.add_all(temp_list);
    return cost_children;
  }
  list_of_clusters.add(tree_node);
  return cost;
}

One way is to use hierarchical K-means , but you keep splitting each cluster which is larger than K, until all of them are smaller. 一种方法是使用分层K-means ,但是您要分割每个大于K的簇,直到所有簇都变小为止。

Another (in some sense opposite approach) would be to use hierarchical agglomerative clustering , ie a bottom up approach and again make sure you don't merge cluster if they'll form a new one of size > K. 另一种方法(在某种意义上是相反的方法)是使用分层的聚类聚类 ,即自下而上的方法,并再次确保如果合并将形成大小> K的新聚类,则不要合并聚类。

The issue with naive clustering is that you do indeed have to calculate a distance matrix that holds the distance of A from every other member in the set. 天真的聚类的问题在于,您确实必须计算一个距离矩阵,该距离矩阵必须保持A与集合中每个其他成员的距离。 It depends whether you've pre-processed the population or your amalgamating the clusters into typical individuals then recalculating the distance matrix again. 这取决于您是否对种群进行了预处理或将群集合并为典型个体,然后重新计算距离矩阵。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM