DBSCAN sklearn非常慢

Question

I am trying to cluster a dataset has more than 1 million data points. 我正在尝试对具有超过一百万个数据点的数据集进行聚类。 One column has text and the other column has a numeric value corresponding to it. 一列具有文本，另一列具有与之对应的数值。 The problem that I am facing is that it gets stuck and never completes. 我面临的问题是它卡住了并且永远无法完成。 I have tried to work with smaller datasets of around 100,000 and it works fairly quickly but as I start increasing data points it starts slowing down and for a million it never completes and hangs. 我曾尝试使用约100,000个较小的数据集，并且运行速度相当快，但是随着我开始增加数据点，它开始变慢，一百万个它从未完成并挂起。 Initially, I thought it might be because I have a tfidf matrix for text and there are 100 dimensions so it is taking a long time. 最初，我认为这可能是因为我有一个用于文本的tfidf矩阵，并且有100个尺寸，所以要花很长时间。 Then I tried clustering based on the amount which is just a single value for each data point and it still did not complete. 然后，我尝试基于数量（仅是每个数据点的单个值）进行聚类，但仍未完成。 Below is the code snippet. 下面是代码片段。 Any idea what I might be doing wrong? 知道我做错了什么吗？ I have seen people working with larger data sets and having no problem. 我见过人们在使用更大的数据集并且没有问题。

Y=data['amount'].values
Y=Y.reshape(-1,1)
dbscan = DBSCAN(eps=0.3, min_samples= 10, algorithm='kd_tree')
dbscan.fit_predict(Y)
labels = dbscan.labels_
print(labels.size)
clusters = labels.tolist()
#printing the value and its label
for a, b in zip(labels, Y):
    print(a, b)

Answer 1

Use more cores. 使用更多核心。

Use the n_jobs parameter. 使用n_jobs参数。 Define it as: n_jobs=-1 inside DBSCAN class. 在DBSCAN类n_jobs=-1其定义为： n_jobs=-1 。

Example: 例：

Y=data['amount'].values
Y=Y.reshape(-1,1)
dbscan = DBSCAN(eps=0.3, min_samples= 10, algorithm='kd_tree', n_jobs=-1)
dbscan.fit_predict(Y)
labels = dbscan.labels_
print(labels.size)
clusters = labels.tolist()
#printing the value and its label
for a, b in zip(labels, Y):
    print(a, b)

Answer 2

Most likely your epsilon is too large. 您的epsilon太大了。

If most points are within epsilon of most other points, then the runtime will be quadratic O(n²). 如果大多数点在大多数其他点的epsilon内，则运行时间将是二次O（n²）。 So begin with small values! 因此， 从小的价值开始！

You can't just add/remove features and leave epsilon unchanged. 您不能只是添加/删除功能并保留epsilon不变。

DBSCAN sklearn非常慢

问题描述

2 个解决方案

解决方案1
1 2018-09-29 21:55:42

Use more cores. 使用更多核心。

解决方案2
0 2018-09-30 06:00:14

DBSCAN sklearn非常慢

问题描述

2 个解决方案

解决方案1 1 2018-09-29 21:55:42

Use more cores. 使用更多核心。

解决方案2 0 2018-09-30 06:00:14

解决方案1
1 2018-09-29 21:55:42

解决方案2
0 2018-09-30 06:00:14