DBSCAN sklearn is very slow

Question

I am trying to cluster a dataset has more than 1 million data points. One column has text and the other column has a numeric value corresponding to it. The problem that I am facing is that it gets stuck and never completes. I have tried to work with smaller datasets of around 100,000 and it works fairly quickly but as I start increasing data points it starts slowing down and for a million it never completes and hangs. Initially, I thought it might be because I have a tfidf matrix for text and there are 100 dimensions so it is taking a long time. Then I tried clustering based on the amount which is just a single value for each data point and it still did not complete. Below is the code snippet. Any idea what I might be doing wrong? I have seen people working with larger data sets and having no problem.

Y=data['amount'].values
Y=Y.reshape(-1,1)
dbscan = DBSCAN(eps=0.3, min_samples= 10, algorithm='kd_tree')
dbscan.fit_predict(Y)
labels = dbscan.labels_
print(labels.size)
clusters = labels.tolist()
#printing the value and its label
for a, b in zip(labels, Y):
    print(a, b)

Answer 1

Use more cores.

Use the n_jobs parameter. Define it as: n_jobs=-1 inside DBSCAN class.

Example:

Y=data['amount'].values
Y=Y.reshape(-1,1)
dbscan = DBSCAN(eps=0.3, min_samples= 10, algorithm='kd_tree', n_jobs=-1)
dbscan.fit_predict(Y)
labels = dbscan.labels_
print(labels.size)
clusters = labels.tolist()
#printing the value and its label
for a, b in zip(labels, Y):
    print(a, b)

Answer 2

Most likely your epsilon is too large.

If most points are within epsilon of most other points, then the runtime will be quadratic O(n²). So begin with small values!

You can't just add/remove features and leave epsilon unchanged.

DBSCAN sklearn is very slow

Question

2 answers

solution1
1 2018-09-29 21:55:42

Use more cores.

solution2
0 2018-09-30 06:00:14

DBSCAN sklearn is very slow

Question

2 answers

solution1 1 2018-09-29 21:55:42

Use more cores.

solution2 0 2018-09-30 06:00:14

solution1
1 2018-09-29 21:55:42

solution2
0 2018-09-30 06:00:14