[英]Scikit-Learn DBSCAN clustering yielding no clusters
I have a data set with a dozen dimensions (columns) and about 200 observations (rows).我有一个包含十几个维度(列)和大约 200 个观察值(行)的数据集。 This dataset has been normalized using
quantile_transform_normalize
.此数据集已使用
quantile_transform_normalize
进行标准化。 (Edit: I tried running the clustering without normalization, but still no luck, so I don't believe this is the cause.) Now I want to cluster the data into several clusters. (编辑:我尝试在没有标准化的情况下运行聚类,但仍然没有运气,所以我不相信这是原因。)现在我想将数据聚类到几个聚类中。 Until now I had been using KMeans, but I have read that it may not be accurate in higher dimensions and doesn't handle outliers well, so I wanted to compare to DBSCAN to see if I get a different result.
到目前为止,我一直在使用 KMeans,但我读到它在更高维度上可能不准确并且不能很好地处理异常值,所以我想与 DBSCAN 进行比较,看看我是否得到了不同的结果。
However, when I try to cluster the data with DBSCAN using the Mahalanobis distance metric, every item is clustered into -1.但是,当我尝试使用马哈拉诺比斯距离度量通过 DBSCAN 对数据进行聚类时,每个项目都被聚类为 -1。 According to the documentation:
根据文档:
Noisy samples are given the label -1.
嘈杂的样本被赋予标签 -1。
I'm not really sure what this means, but I was getting some OK clusters with KMeans so I know there is something there to cluster -- it's not just random.我不太确定这意味着什么,但是我用 KMeans 得到了一些不错的集群,所以我知道有一些东西可以集群——它不仅仅是随机的。
Here is the code I am using for clustering:这是我用于聚类的代码:
covariance = np.cov(data.values.astype("float32"), rowvar=False)
clusterer = sklearn.cluster.DBSCAN(min_samples=6, metric="mahalanobis", metric_params={"V": covariance})
clusterer.fit(data)
And that's all.就这样。 I know for certain that
data
is a numeric Pandas DataFrame as I have inspected it in the debugger.我确定
data
是一个数字 Pandas DataFrame,因为我已经在调试器中检查过它。
What could be causing this issue?什么可能导致这个问题?
You need to choose the parameter eps
, too.您还需要选择参数
eps
。
DBSCAN results depend on this parameter very much. DBSCAN 结果非常依赖于这个参数。 You can find some methods for estimating it in literature.
你可以在文献中找到一些估计它的方法。
IMHO, sklearn
should not provide a default for this parameter, because it rarely ever works (on normalized toy data it is usually okay, but that's about it).恕我直言,
sklearn
不应该为此参数提供一个默认的,因为它很少能工作的(归一化数据的玩具它通常是好的,但关于它的)。
200 instances probably is too small to reliably measure density, in particular with a dozen variables. 200 个实例可能太小而无法可靠地测量密度,尤其是有十几个变量时。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.