简体   繁体   English

如何定义sklearn.cluster.DBSCAN的eps参数的值范围?

[英]How to define a range of values for the eps parameter of sklearn.cluster.DBSCAN?

I want to use DBSCAN with the metric sklearn.metrics.pairwise.cosine_similarity to cluster points that have cosine similarity close to 1 (ie whose vectors (from "the" origin) are parallel or almost parallel).我想使用带有度量sklearn.metrics.pairwise.cosine_similarity 的DBSCAN 来聚类具有接近 1 的余弦相似度的点(即其向量(来自“原点”)平行或几乎平行)。

The issue:问题:

eps is the maximum distance between two samples for them to be considered as in the same neighbourhood by DBSCAN - meaning that if the distance between two points is lower than or equal to eps, these points are considered neighbours; eps 是两个样本之间的最大距离,它们被 DBSCAN 视为在同一邻域中——这意味着如果两点之间的距离小于或等于eps,则这些点被认为是邻域;

but

sklearn.metrics.pairwise.cosine_similarity spits out values between -1 and 1 and I want DBSCAN to consider two points to be neighbours if the distance between them is, say, between 0.75 and 1 - ie greater than or equal to 0.75. sklearn.metrics.pairwise.cosine_similarity 吐出 -1 和 1 之间的值,如果它们之间的距离在 0.75 和 1 之间 - 即大于或等于0.75,我希望 DBSCAN 将两个点视为邻居。

I see two possible solutions:我看到两种可能的解决方案:

  1. pass a range of values to the eps parameter of DBSCAN eg eps=[0.75,1]将一系列值传递给 DBSCAN 的 eps 参数,例如 eps=[0.75,1]

  2. Pass the value eps=-0.75 to DBSCAN but (somehow) force it to use the negative of the cosine similarities matrix that is spit out by sklearn.metrics.pairwise.cosine_similarity将值 eps=-0.75 传递给 DBSCAN,但(以某种方式)强制它使用由 sklearn.metrics.pairwise.cosine_similarity 吐出的余弦相似度矩阵的负值

I do not know how to implement either of these.我不知道如何实现其中任何一个。

Any guidance would be appreciated!任何指导将不胜感激!

DBSCAN has a metric keyword argument. DBSCAN有一个metric关键字参数。 Docstring:文档字符串:

metric : string, or callable The metric to use when calculating distance between instances in a feature array. metric : string, or callable 在计算特征数组中实例之间的距离时使用的度量。 If metric is a string or callable, it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter.如果 metric 是字符串或可调用的,则它必须是 metrics.pairwise.calculate_distance 为其 metric 参数所允许的选项之一。 If metric is "precomputed", X is assumed to be a distance matrix and must be square.如果度量是“预先计算的”,则假定 X 是距离矩阵并且必须是平方。 X may be a sparse matrix, in which case only "nonzero" elements may be considered neighbors for DBSCAN. X 可能是一个稀疏矩阵,在这种情况下,只有“非零”元素可以被视为 DBSCAN 的邻居。

So probably the easiest thing to do is to precompute a distance matrix using cosine similarity as your distance metric, preprocess the distance matrix such that it fits your bespoke distance criterion (probably something like D = np.abs(np.abs(CD) -1) , where CD is your cosine distance matrix), and then set metric to precomputed , and pass the precomputed distance matrix D in for X , ie the data.所以可能最简单的方法是使用余弦相似度作为距离度量预先计算距离矩阵,预处理距离矩阵,使其符合您定制的距离标准(可能类似于D = np.abs(np.abs(CD) -1) ,其中 CD 是您的余弦距离矩阵),然后将metric设置为precomputed ,并将预先计算的距离矩阵D传递给X ,即数据。

For example:例如:

#!/usr/bin/env python

import numpy as np

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.cluster import DBSCAN

total_samples = 1000
dimensionality = 3
points = np.random.rand(total_samples, dimensionality)

cosine_distance = cosine_similarity(points)

# option 1) vectors are close to each other if they are parallel
bespoke_distance = np.abs(np.abs(cosine_distance) -1)

# option 2) vectors are close to each other if they point in the same direction
bespoke_distance = np.abs(cosine_distance - 1)

results = DBSCAN(metric='precomputed', eps=0.25).fit(bespoke_distance)

A) check out Generalized DBSCAN which works fine with similarities too. A)检查通用 DBSCAN,它也可以很好地相似。 With cosine, sklearn will supposedly be slow anyway.使用余弦,sklearn 无论如何都会很慢。

B) you can trivially use: cosine distance = 1 - cosine similarity. B)你可以简单地使用:余弦距离= 1 - 余弦相似度。 But that may well cause the sklearn implementation to run in O(n²).但这很可能导致 sklearn 实现以 O(n²) 运行。

C) you supposedly can even pass -cosinesimilarity as precomputed distance matrix and use -0.75 as eps. C)你甚至可以将-cosinesimilarity作为预先计算的距离矩阵传递并使用 -0.75 作为 eps。

d) just make a binary distance matrix (in O(n²) memory, though, so slow), where distance = 0 of the cosine similarity is larger than your threshold, and 0 otherwise. d) 只需制作一个二进制距离矩阵(在 O(n²) 内存中,虽然如此缓慢),其中余弦相似度的距离 = 0 大于您的阈值,否则为 0。 Then use DBSCAN with eps=0.5.然后使用 eps=0.5 的 DBSCAN。 it is trivial to show that distance < eps if and only if similarity > threshold.当且仅当相似性>阈值时,证明距离<eps是微不足道的。

A few options:几个选项:

  1. dist = np.abs(cos_sim - 1) accepted answer here dist = np.abs(cos_sim - 1)在这里接受答案
  2. dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178 dist = np.arccos(cos_sim) / np.pi https://math.stackexchange.com/a/3385463/816178
  3. dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178 dist = 1 - (sim + 1) / 2 https://math.stackexchange.com/q/3241174/816178

I've found they all work the same in practice for this application (precomputed distances in hierarchical clustering; I've hit the snag too).我发现它们在这个应用程序的实践中都是一样的(分层聚类中的预计算距离;我也遇到了障碍)。 As I understand #2 is the more mathematically-correct approach;据我了解,#2 是数学上更正确的方法; preserving angular distance.保留角距离。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM