简体繁体 English

聚类非正态分布数据的距离度量

[英]distance metrics for clustering non-normally distributed data

原文 2019-08-26 23:04:48 4 1 cluster-analysis/ distance/ non-uniform-distribution

The dataset I want to cluster consists of ~1000 samples and 10 features, which have different scales and ranges (negative, positive, both). 我要聚类的数据集包含约1000个样本和10个要素，它们具有不同的比例和范围（负，正，均）。 Using scipy.stats.normaltest() I found that none of the features are normally-distributed (all p-values < 1e-4, small enough to reject the null hypothesis that the data are taken from a normal distribution). 使用scipy.stats.normaltest（），我发现没有一个特征是正态分布的（所有p值<1e-4，小得足以拒绝零假设，即数据取自正态分布）。 But all of the distance measures that I'm aware of assume normally-distributed data (I was using Mahalanobis until I realized how non-uniform the data was). 但是我所知道的所有距离度量都假设数据是正态分布的（我一直使用Mahalanobis，直到我意识到数据的不均匀性为止）。 What distance measures would one use in this situation? 在这种情况下，将使用什么距离度量？ Or is this where one simply has to normalize every feature and hope that that doesn't introduce bias? 还是这只是在其中只需标准化每项功能并希望不会引入偏差？

1 个解决方案

Why do you think all distances would assume normal (which btw. is not the same as uniform) data? 为什么您认为所有距离都将假定为正常数据（顺便说一下，与统一数据不同）？

Consider Euclidean distance. 考虑欧几里得距离。 In many physical applications this distance makes perfect sense, because it is "as the crow flies". 在许多物理应用中，此距离非常合理，因为它是“乌鸦飞翔时”。 Manhattan distance makes a lot of sense when movement is constrained to two axes that cannot be used at the same time. 当运动被限制在两个不能同时使用的轴上时，曼哈顿距离就变得很有意义。 These are completely appropriate for non-normal distributed data. 这些完全适用于非正态分布的数据。