简体繁体中英

ELKI's LOF implementation for heavily duplicated data

原文 2015-09-03 05:02:11 0 1 probability/ nan/ duplicate-data/ outliers/ elki

Does ELKI fail for data which has many duplicate values in it? I have files with more than 2 million observations(1D), but it contains only a few hundred unique values. The rest are duplicates. When I run this file in ELKI, for LOF or LoOP calculations, it returns NAN as outlier scores for any k less than the number of occurrences of a value with highest frequency. I can imagine the LRD calculation must be causing this problem if duplicates are taken as nearest neighbours. But should'nt it NOT be doing this? Can we rely on the results ELKI is producing for such cases?

1 answers

It is not so much a matter of ELKI, but of the algorithms.

Most outlier detection algorithms use the k nearest neighbors. If these are identical, the values can be problematic. In LOF, the neighbors of duplicated points can obtain an outlier score of infinity. Similarly, the outlier scores of LoOP probably reach NaN due to a division by 0 if there are too many duplicates.

But that is not a matter of ELKI, but of the definition of these methods. Any implementation that sticks to these definition should exhibit these effects. There are some methods to avoid/reduce the effects:

add jitter to the data set
remove duplicates (but never consider highly dupilcated values outliers!)
increase the neighborhood size

It is easy to prove that such results do arise in LOF/LoOP equations if the data has duplicates.

This limitation of these algorithms can most probably be "fixed", but we want the implementations in ELKI to be close to the original publication, so we avoid doing unpublished changes. But if a "LOFdup" method is published and contributed to ELKI, we would add that obviously.

Note that neither LOF nor LoOP is meant to be used with 1-dimensional data. For 1-dimensional data, I recommend focusing on "traditional" statistical literature instead, such as kernel density estimation . 1-dimensional numerical data is special, because it is ordered - this allows for both optimizations and much more advanced statistics that would be infeasible or require too much observations on multivariate data. LOF and similar methods are very basic statistics (so basic that many statisticians would outright reject them as "stupid" or "naive") - with the key benefit that they easily scale to large, multivariate data sets. Sometimes naive methods such as naive bayes can work very well in practise; the same holds for LOF and LoOP: there are some questionable decisions in the algorithms. But they work, and scale. Just as with naive bayes - the independence assumption in naive bayes is questionable, but naive bayes classification often works well, and scales very well.

In other words, this is not a bug in ELKI. The implementation does what is published .

Numerical accuracy with log probability Java implementation

Why does this implementation of the Monty Hall give the probability to be 50%?

Implementation of sequential monte carlo method (particle filters)

Monty Hall implementation

Focal loss implementation

My Data keeps trending downwards even though it's equal probabilities

Implementation of a simple algorithm (to calculate probability)

Rosalind: Mendel's first law

What's the term to describe this combination?

R implementation of a Multinominal Problem: Probability of n-times head in k throws with varying probabilities per throw

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Numerical accuracy with log probability Java implementation Why does this implementation of the Monty Hall give the probability to be 50%? Implementation of sequential monte carlo method (particle filters) Monty Hall implementation Focal loss implementation My Data keeps trending downwards even though it's equal probabilities Implementation of a simple algorithm (to calculate probability) Rosalind: Mendel's first law What's the term to describe this combination? R implementation of a Multinominal Problem: Probability of n-times head in k throws with varying probabilities per throw

Related Tags

ELKI's LOF implementation for heavily duplicated data

Question

1 answers

solution1 1 ACCPTED 2015-09-14 08:22:20

solution1
1 ACCPTED 2015-09-14 08:22:20