简体   繁体   中英

DBSCAN clustering is not working even on 40k data but working on 10k data using python and sklearn

I am trying to cluster my dataset. I have 700k rows in my data set. I took 40k from it and tried DBSCAN clustering in python and sklearn. I ran on 32 GB ram. The algorithm ran the whole night but it didn't finish and I stopped the program then manually.

But when I tried with 10k data set it was running.

Is there any limitation of DBSCAN in the case of dataset size?

I used below code:

clustering = DBSCAN().fit(df)
pred_y = clustering.labels_

and also below version:

clustering = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2).fit(df)
pred_y = clustering.labels_

How can I use DBSCAN clustering in my dataset?

[UPDATED]

How many columns are we talking about here? From my experience with DBSCAN it's the number of columns that impacts performance more than the number of rows. I tend to use PCA before fitting the data into DBSCAN to reduce the dimensionallity - this significantly speeds up the clustering process.

Based on the additional info you provided I did a simple repro (warning: With current params it will run for looooong time):

import numpy as np

data_1k = np.random.random((1000, 4))
data_5k = np.random.random((5000, 4))
data_10k = np.random.random((10000, 4))
data_40k = np.random.random((40000, 4))

from sklearn.cluster import DBSCAN

clustering = DBSCAN(eps=9.7, min_samples=2, algorithm='ball_tree', metric='minkowski', leaf_size=90, p=2).fit(data_40k)
np.unique(clustering.labels_)

Above code will execute in just several seconds for 10k dataset but for 40k it will be processing it for super long time and it is caused by a really high value for eps param. Are you absolutely sure that it's the "right" value for your data?
For the example I provided above, simply lowering down eps value (eg to 0.08 ) speeds up the process to just ~3 seconds.

In case eps=9.7 is truly the required value for your data I would consider using some scalers or something that would maybe help reduce the values range? (and lower down eps value)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM