简体   繁体   中英

How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (ie whose vectors (from "the" origin) are parallel or almost parallel).

The issue:

eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours;

but

sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - ie greater than or equal to 0.75.

I see two possible solutions:

  1. pass a range of values to the eps parameter of DBSCAN eg eps=[0.75,1]

  2. Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity

I do not know how to implement either of these.

Any guidance would be appreciated!

DBSCAN has a metric keyword argument. Docstring:

metric : string, or callable The metric to use when calculating distance between instances in a feature array. If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter. If metric is "precomputed", X is assumed to be a distance matrix and must be square. X may be a sparse matrix, in which case only "nonzero" elements may be considered neighbors for DBSCAN.

So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1) , where CD is your cosine distance matrix), and then set metric to precomputed , and pass the precomputed distance matrix D in for X , ie the data.

For example:

#!/usr/bin/env python

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN

total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)

cosine_distance = cosine_similarity(points)

# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)

# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)

results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)

A) check out Generalized DBSCAN which works fine with similarities too. With cosine, sklearn will supposedly be slow anyway.

B) you can trivially use: cosine distance = 1 - cosine similarity. But that may well cause the sklearn implementation to run in O(n²).

C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps.

d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. Then use DBSCAN with eps=0.5. it is trivial to show that distance < eps if and only if similarity > threshold.

A few options:

  1. dist = np.abs(cos_sim - 1) accepted answer here
  2. dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
  3. dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178

I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too). As I understand #2 is the more mathematically-correct approach; preserving angular distance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM