简体   繁体   中英

ELKI's LOF implementation for heavily duplicated data

Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?

It is not so much a matter of ELKI, but of the algorithms.

Most outlier detection algorithms use the k nearest neighbors. If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.

But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:

  • add jitter to the data set
  • remove duplicates (but never consider highly dupilcated values outliers!)
  • increase the neighborhood size

It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.

This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.

Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation . 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.

In other words, this is not a bug in ELKI. The implementation does what is published .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM