简体   繁体   English

Scikit-Learn DBSCAN 聚类不产生聚类

[英]Scikit-Learn DBSCAN clustering yielding no clusters

I have a data set with a dozen dimensions (columns) and about 200 observations (rows).我有一个包含十几个维度(列)和大约 200 个观察值(行)的数据集。 This dataset has been normalized using quantile_transform_normalize .此数据集已使用quantile_transform_normalize进行标准化。 (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. (编辑:我尝试在没有标准化的情况下运行聚类,但仍然没有运气,所以我不相信这是原因。)现在我想将数据聚类到几个聚类中。 Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.到目前为止,我一直在使用 KMeans,但我读到它在更高维度上可能不准确并且不能很好地处理异常值,所以我想与 DBSCAN 进行比较,看看我是否得到了不同的结果。

However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1.但是,当我尝试使用马哈拉诺比斯距离度量通过 DBSCAN 对数据进行聚类时,每个项目都被聚类为 -1。 According to the documentation:根据文档:

Noisy samples are given the label -1.嘈杂的样本被赋予标签 -1。

I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.我不太确定这意味着什么,但是我用 KMeans 得到了一些不错的集群,所以我知道有一些东西可以集群——它不仅仅是随机的。

Here is the code I am using for clustering:这是我用于聚类的代码:

covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)

And that's all.就这样。 I know for certain that data is a numeric Pandas DataFrame as I have inspected it in the debugger.我确定data是一个数字 Pandas DataFrame,因为我已经在调试器中检查过它。

What could be causing this issue?什么可能导致这个问题?

You need to choose the parameter eps , too.您还需要选择参数eps

DBSCAN results depend on this parameter very much. DBSCAN 结果非常依赖于这个参数。 You can find some methods for estimating it in literature.你可以在文献中找到一些估计它的方法。

IMHO, sklearn should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).恕我直言, sklearn应该为此参数提供一个默认的,因为它很少能工作的(归一化数据的玩具它通常是好的,但关于它的)。

200 instances probably is too small to reliably measure density, in particular with a dozen variables. 200 个实例可能太小而无法可靠地测量密度,尤其是有十几个变量时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM