Clustering with DBSCAN is surprisingly slow

Question

I am experimenting with clustering and I am surprised how slow it seems to be. I have produced a random graph with 30 communities each containing 30 nodes. Nodes in a communities have a 90% chance of being connected and edges between nodes not in the same community have a 10% chance of being connected. I am measuring the similarly between two nodes as the Jaccard similarity between their sets of neighbors.

This toy example spends about 15 seconds just on the dbscan part and this increases very rapidly if I increase the number of nodes. As there are only 900 nodes in total this seems very slow.

from __future__ import division
import numpy as np
from sklearn.cluster import dbscan
import networkx as nx
import matplotlib.pyplot as plt
import time

#Define the Jaccard distance. Following example for clustering with Levenshtein distance from from http://scikit-learn.org/stable/faq.html
def jaccard_distance(x,y):
    return 1 - len(neighbors[x].intersection(neighbors[y]))/len(neighbors[x].union(neighbors[y]))

def jaccard_metric(x,y):
    i, j = int(x[0]), int(y[0])     # extract indices
    return jaccard_distance(i, j)

#Simulate a planted partition graph. The simplest form of community detection benchmark.
num_communities = 30
size_of_communities = 30
print "planted partition"
G = nx.planted_partition_graph(num_communities, size_of_communities, 0.9, 0.1,seed=42)

#Make a hash table of sets of neighbors for each node.
neighbors={}
for n in G:
    for nbr in G[n]:
        if not (n in neighbors):
            neighbors[n] = set()
        neighbors[n].add(nbr)

print "Made data"

X= np.arange(len(G)).reshape(-1,1)

t = time.time()
db = dbscan(X, metric = jaccard_metric, eps=0.85, min_samples=2)
print db

print "Clustering took ", time.time()-t, "seconds"

How can I make this more scalable to larger numbers of nodes?

Answer 1

Here a solution that speeds up the DBSCAN call about 1890-fold on my machine:

# the following code should be added to the question's code (it uses G and db)

import igraph

# use igraph to calculate Jaccard distances quickly
edges = zip(*nx.to_edgelist(G))
G1 = igraph.Graph(len(G), zip(*edges[:2]))
D = 1 - np.array(G1.similarity_jaccard(loops=False))

# DBSCAN is much faster with metric='precomputed'
t = time.time()
db1 = dbscan(D, metric='precomputed', eps=0.85, min_samples=2)
print "clustering took %.5f seconds" %(time.time()-t)

assert np.array_equal(db, db1)

Here the output:

...
Clustering took  8.41049790382 seconds
clustering took 0.00445 seconds

Clustering with DBSCAN is surprisingly slow

Question

1 answers

solution1
5 ACCPTED 2016-07-31 09:05:25

Clustering with DBSCAN is surprisingly slow

Question

1 answers

solution1 5 ACCPTED 2016-07-31 09:05:25

solution1
5 ACCPTED 2016-07-31 09:05:25