简体   繁体   中英

DBSCAN: How to Cluster Large Dataset with One Huge Cluster

I am trying to perform DBSCAN on 18 million data points, so far just 2D but hoping to go up to 6D. I have not been able to find a way to run DBSCAN on that many points. The closest I got was 1 million with ELKI and that took an hour. I have used Spark before but unfortunately it does not have DBSCAN available.

Therefore, my first question is if anyone can recommend a way of running DBSCAN on this much data, likely in a distributed way?

Next, the nature of my data is that the ~85% lies in one huge cluster (anomaly detection). The only technique I have been able to come up with to allow me to process more data is to replace a big chunk of that huge cluster with one data point in a way that it can still reach all its neighbours (the deleted chunk is smaller than epsilon).

Can anyone provide any tips whether I'm doing this right or if there is a better way to reduce the complexity of DBSCAN when you know that most data is in one cluster centered around (0.0,0.0)?

  1. Have you added an index to ELKI, and tried the parallel version? Except for the git version, ELKI will not automatically add an index; and even then fine-turning the index for the problem can help.

  2. DBSCAN is not a good approach for anomaly detection - noise is not the same as anomalies. I'd rather use a density-based anomaly detection. There are variants that try to skip over "clear inliers" more efficiently if you know you are only interested in the top 10%.

  3. If you already know that most of your data is in one huge cluster, why don't you directly model that big cluster , and remove it / replace it with a smaller approximation.

  4. Subsample . There usually is next to no benefit to using the entire data . Even (or in particular) if you are interested in the "noise" objects, there is the trivial strategy of randomly splitting your data in, eg, 32 subsets, then cluster each of these subsets, and join the results back. These 32 parts can be trivially processed in parallel on separate cores or computers; but because the underlying problem is quadratic in nature, the speedup will be anywhere between 32 and 32*32=1024. This in particular holds for DBSCAN: larger data usually means you also want to use much larger minPts. But then the results will not differ much from a subsample with smaller minPts.

But by any means: before scaling to larger data, make sure your approach solves your problem, and is the smartest way of solving this problem . Clustering for anomaly detection is like trying to smash a screw into the wall with a hammer. It works, but maybe using a nail instead of a screw is the better approach.

Even if you have "big" data, and are proud of doing "big data", always begin with a subsample. Unless you can show that the result quality increases with data set size, don't bother scaling to big data, the overhead is too high unless you can prove value .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM