简体繁体 English

DBSCAN集群集群（sklearn python）

[英]DBSCAN clusters of cluster (sklearn python)

原文 2015-06-24 14:50:49 6 2 python/ scikit-learn/ cluster-computing/ dbscan

I have elements of different categories that need to be clustered separately (according to their category) and then all together. 我有不同类别的元素，需要分别进行聚类（根据其类别），然后再进行聚类。 Each element has a location (latitude,longitude). 每个元素都有一个位置（纬度，经度）。

My goal is to determine the clusters (group of different categories) of cluster (group of different elements in the same category) like in the following pictures: http://i.imgur.com/V5Dovcf.png 我的目标是确定群集（同一类别中不同元素的组）的群集（不同类别的组），如下图所示： http : //i.imgur.com/V5Dovcf.png

In my case the distance between two elements that should be included in a cluster is the same distance as the distance between two clusters of clusters. 在我的情况下，应包含在一个群集中的两个元素之间的距离与两个群集中的群集之间的距离相同。 For example in the picture with the blue cluster. 例如，在带有蓝色簇的图片中。 Since all the elements in this blue cluster are separeted by a distance of d at most (from any elements of the cluster) then they belong in the blue cluster. 由于此蓝色群集中的所有元素（与群集中的任何元素）最多相距d距离，因此它们属于蓝色群集。 It's the same for the red cluster where we include the elements that are separated by a distance of d at most 对于红色簇，其中包含最多相距d的元素的情况相同

With DBSCAN I can easily find the clusters of all of these elements if I provide as input all the elements together. 如果我提供所有元素在一起作为输入，那么使用DBSCAN可以轻松找到所有这些元素的簇。 And If I want to find the clusters of each category, then I will have to provide as input only the different category and run DBSCAN one by one. 而且，如果我想查找每个类别的集群，那么我将不得不仅提供不同类别的输入作为输入，并逐一运行DBSCAN。 But I guess there should be something much faster than running many times DBSCAN to get these clusters of clusters 但是我想应该有比运行多次DBSCAN更快的速度来获取这些群集集群

2 个解决方案

Why do you think it would be faster to mix categories that you want to be separate? 您为什么认为混合要分开的类别会更快？

Do the cheap operations first, such as splitting your data set. 首先执行廉价操作，例如拆分数据集。 Then process each partition independently. 然后独立处理每个分区。

As far as I know, scipy cannot accelerate geodetic distances. 据我所知，科学不能加快大地测量的距离。 So you will have to do O(n^2) distance computations. 因此，您将必须执行O（n ^ 2）距离计算。 If you have 10 categories, your problem gets 10x faster if you can split it into such partitions, and run DBSCAN 10 times, because each run is 10^2 times cheaper! 如果您有10个类别，则可以将其分成这样的分区并运行10次DBSCAN，则问题的速度会提高 10倍，因为每次运行的费用要便宜10 ^ 2倍！

It seems to me the main problem here is due to the multi-representation or hierarchical nature (categories and clusters within categories) of your data. 在我看来，这里的主要问题是由于数据的多重表示或分层性质（类别中的类别和群集）所致。 Typically, if the distances are based on a singular dimension, the two dimensions (say, cluster distance and category distance) could be clubbed to form a new, singular dimension where the data representation becomes simpler. 通常，如果距离基于奇异维度，则可以将这两个维度（例如，簇距离和类别距离）合并在一起以形成新的奇异维度，从而使数据表示变得更简单。

Maybe this helps? 也许这可以帮助？

Some material I found that may be helpful: 我发现一些有用的材料：