DBSCAN 集群甚至无法处理 40k 数据，但使用 python 和 sklearn 处理 10k 数据

Question

I am trying to cluster my dataset.我正在尝试对我的数据集进行聚类。 I have 700k rows in my data set.我的数据集中有 700k 行。 I took 40k from it and tried DBSCAN clustering in python and sklearn.我从中取出 40k 并尝试在 python 和 sklearn 中进行 DBSCAN 集群。 I ran on 32 GB ram.我在 32 GB 内存上运行。 The algorithm ran the whole night but it didn't finish and I stopped the program then manually.该算法运行了一整夜，但没有完成，然后我手动停止了程序。

But when I tried with 10k data set it was running.但是当我尝试使用 10k 数据集时，它正在运行。

Is there any limitation of DBSCAN in the case of dataset size?在数据集大小的情况下，DBSCAN 是否有任何限制？

I used below code:我使用了以下代码：

clustering = DBSCAN().fit(df)
pred_y = clustering.labels_

and also below version:以及以下版本：

clustering = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2).fit(df)
pred_y = clustering.labels_

How can I use DBSCAN clustering in my dataset?如何在我的数据集中使用 DBSCAN 聚类？

Answer 1

[UPDATED] [更新]

How many columns are we talking about here?我们在这里谈论多少列？ From my experience with DBSCAN it's the number of columns that impacts performance more than the number of rows.根据我使用 DBSCAN 的经验，列数对性能的影响大于行数。 I tend to use PCA before fitting the data into DBSCAN to reduce the dimensionallity - this significantly speeds up the clustering process.我倾向于在将数据拟合到 DBSCAN 之前使用 PCA 来降低维度——这显着加快了聚类过程。

Based on the additional info you provided I did a simple repro (warning: With current params it will run for looooong time):根据您提供的其他信息，我做了一个简单的复制（警告：使用当前参数，它将运行很长时间）：

import numpy as np

data_1k = np.random.random((1000, 4))
data_5k = np.random.random((5000, 4))
data_10k = np.random.random((10000, 4))
data_40k = np.random.random((40000, 4))

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2).fit(data_40k)
np.unique(clustering.labels_)

Above code will execute in just several seconds for 10k dataset but for 40k it will be processing it for super long time and it is caused by a really high value for eps param.对于 10k 数据集，上述代码将在几秒钟内执行，但对于 40k 数据集，它将处理它很长时间，这是由于eps参数的值非常高造成的。 Are you absolutely sure that it's the "right" value for your data?您绝对确定它是您数据的“正确”值吗？
For the example I provided above, simply lowering down eps value (eg to 0.08 ) speeds up the process to just ~3 seconds.对于我上面提供的示例，只需降低eps值（例如降低到0.08 ）即可将过程加速到大约 3 秒。

In case eps=9.7 is truly the required value for your data I would consider using some scalers or something that would maybe help reduce the values range?如果eps=9.7确实是您的数据所需的值，我会考虑使用一些缩放器或可能有助于缩小值范围的东西？ (and lower down eps value) （并降低eps值）

DBSCAN 集群甚至无法处理 40k 数据，但使用 python 和 sklearn 处理 10k 数据

问题描述

1 个解决方案

解决方案1
0 2020-06-26 08:03:09

DBSCAN 集群甚至无法处理 40k 数据，但使用 python 和 sklearn 处理 10k 数据

问题描述

1 个解决方案

解决方案1 0 2020-06-26 08:03:09

解决方案1
0 2020-06-26 08:03:09