简体繁体 English

ELKI的LOF实现可用于高度重复的数据

[英]ELKI's LOF implementation for heavily duplicated data

原文 2015-09-03 05:02:11 7 1 probability/ nan/ duplicate-data/ outliers/ elki

Does ELKI fail for data which has many duplicate values in it? ELKI是否会因为其中包含许多重复值的数据而失败？ I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. 我的文件具有超过200万个观测值（1D），但其中仅包含数百个唯一值。 The rest are duplicates. 其余的重复。 When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. 当我在ELKI中运行此文件以进行LOF或LoOP计算时，对于小于频率最高值出现次数的任何k，它将返回NAN作为离群值。 I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. 我可以想象，如果将重复项作为最近的邻居，则LRD计算一定会导致此问题。 But should'nt it NOT be doing this? 但是不应该这样做吗？ Can we rely on the results ELKI is producing for such cases? 我们可以依靠ELKI在这种情况下产生的结果吗？

1 个解决方案

It is not so much a matter of ELKI, but of the algorithms. 与ELKI无关，而与算法有关。

Most outlier detection algorithms use the k nearest neighbors. 大多数异常值检测算法使用k个最近的邻居。 If these are identical, the values can be problematic. 如果这些相同，则值可能有问题。 In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. 在LOF中，重复点的邻居可以获得异常值的无穷大。 Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates. 同样，如果重复项过多，LoOP的异常值得分可能会由于除以0而达到NaN。

But that is not a matter of ELKI, but of the definition of these methods. 但这不是ELKI的问题，而是这些方法的定义。 Any implementation that sticks to these definition should exhibit these effects. 坚持这些定义的任何实现都应表现出这些效果。 There are some methods to avoid/reduce the effects: 有一些方法可以避免/减少影响：

add jitter to the data set 向数据集添加抖动
remove duplicates (but never consider highly dupilcated values outliers!) 删除重复项（但不要考虑高度重复的值离群值！）
increase the neighborhood size 增加邻里规模

It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates. 如果数据重复，很容易证明在LOF / LoOP方程中确实会出现这种结果。

This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. 这些算法的这种限制很可能可以“修复”，但是我们希望ELKI中的实现与原始发布接近，因此我们避免进行未发布的更改。 But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously. 但是，如果发布了“ LOFdup”方法并为ELKI做出了贡献，我们显然会添加它。

Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. 请注意，LOF和LoOP都不打算与一维数据一起使用。 For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation . 对于一维数据，我建议重点关注“传统”统计文献 ，例如核密度估计 。 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. 一维数值数据是特殊的，因为它是有序的 -这既可以进行优化，也可以进行更高级的统计，这将不可行或需要对多元数据进行过多观察。 LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. LOF和类似方法是非常基础的统计信息（非常基础，以至于许多统计学家都会将其完全拒绝为“愚蠢”或“天真”），其主要好处是可以轻松扩展到大型多元数据集。 Sometimes naive methods such as naive bayes can work very well in practise; 有时，诸如朴素贝叶斯这样的朴素方法可以在实践中很好地起作用。 the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. LOF和LoOP同样适用：算法中存在一些可疑的决策。 But they work, and scale. 但是它们有效并且可以扩展。 Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well. 就像朴素贝叶斯一样，朴素贝叶斯的独立性假设值得怀疑，但是朴素贝叶斯的分类通常效果很好，并且伸缩性很好。