Scikit-Learn DBSCAN 聚类不产生聚类

Question

I have a data set with a dozen dimensions (columns) and about 200 observations (rows).我有一个包含十几个维度（列）和大约 200 个观察值（行）的数据集。 This dataset has been normalized using quantile_transform_normalize .此数据集已使用quantile_transform_normalize进行标准化。 (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. （编辑：我尝试在没有标准化的情况下运行聚类，但仍然没有运气，所以我不相信这是原因。）现在我想将数据聚类到几个聚类中。 Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.到目前为止，我一直在使用 KMeans，但我读到它在更高维度上可能不准确并且不能很好地处理异常值，所以我想与 DBSCAN 进行比较，看看我是否得到了不同的结果。

However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1.但是，当我尝试使用马哈拉诺比斯距离度量通过 DBSCAN 对数据进行聚类时，每个项目都被聚类为 -1。 According to the documentation:根据文档：

Noisy samples are given the label -1.嘈杂的样本被赋予标签 -1。

I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.我不太确定这意味着什么，但是我用 KMeans 得到了一些不错的集群，所以我知道有一些东西可以集群——它不仅仅是随机的。

Here is the code I am using for clustering:这是我用于聚类的代码：

covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)

And that's all.就这样。 I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.我确定data是一个数字 Pandas DataFrame，因为我已经在调试器中检查过它。

What could be causing this issue?什么可能导致这个问题？

Answer 1

You need to choose the parameter eps , too.您还需要选择参数eps 。

DBSCAN results depend on this parameter very much. DBSCAN 结果非常依赖于这个参数。 You can find some methods for estimating it in literature.你可以在文献中找到一些估计它的方法。

IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).恕我直言， sklearn不应该为此参数提供一个默认的，因为它很少能工作的（归一化数据的玩具它通常是好的，但关于它的）。

200 instances probably is too small to reliably measure density, in particular with a dozen variables. 200 个实例可能太小而无法可靠地测量密度，尤其是有十几个变量时。

Scikit-Learn DBSCAN 聚类不产生聚类

问题描述

1 个解决方案

解决方案1
1 已采纳 2018-11-13 08:50:45

Scikit-Learn DBSCAN 聚类不产生聚类

问题描述

1 个解决方案

解决方案1 1 已采纳 2018-11-13 08:50:45

解决方案1
1 已采纳 2018-11-13 08:50:45