简体   繁体   中英

How to find the "odd one out" in a list of numbers

I have an array of numbers [x1, x2, x3, etc] that is size is over 20 elements and I'm trying to put together an algorithm to sort the elements based on the "oddness" they have relative to the rest of the list.

I'm defining the "oddness" as the distance from the barycenters, given some threshold T1. The barycenters are where the values tend to concentrate, possibly given some second threshold T2.

Example: [20, 20, 21, 31, 24, 20, 70, 21, 31, 24, 20, 20, 21, 31, 24, 20, 20, 21, 31, 24] and T1=10 The barycenter is about 24 and only odd one out is 70

This case is trivial as the familiar "distance from the mean or median" metric will do eg. d(70)=|24-70|=46>10=T1 and d(31)=|24-31|=7<10=T1

I can't quite figure out how to deal with the more general case of having 2 or more barycenters.

Example 2: [20, 20, 21, 31, 24, 20, 70, 21, 31, 24, 120, 120, 121, 131, 124, 120, 120, 121, 131, 124] Now there are two barycenters d1=24 and d2=124 and the only odd one is still 70

But the previous metric breaks apart. Maybe the hard part is to figure out which are the barycenters.

Note: I'm looking for a fast algorithm rather than an accurate one

It sounds like the general problem you're trying to solve is this: draw as few radius-R circles as possible such that all inputs are covered by at least one circle; then, find circles containing fewer than k inputs.

In your first case, you draw two radius-10 circles: the first contains all inputs except 70, the second contains just 70. Your criterion for detecting abnormal circles should catch the 70-containing one, which should be simple. In your second case, you draw three radius-10 circles. Again, the criterion that catches the one with 70 only should be easy to state.

If I were going to do this from scratch without looking up what the problem is called (and it's probably a well-known problem with good well-known solutions) I'd start by sorting the inputs, which will probably be very helpful since this is a 1D problem. Next, I'd probably run a sliding window of size 2R over the inputs and compute the moving frequency at each potential barycenter (skipping duplicates and jumping gaps), saving this frequency series separately. Then, I'd greedily place windows at the locations with the highest frequencies first, in as non-overlapping a fashion as possible, until all inputs get covered. Then, I'd identify any inputs that were covered by circles with moving frequency less than some cutoff related to the average moving frequency of chosen windows; for instance, consider as anomalous all inputs covered by circles which cover half as many inputs, or fewer, compared to the average covered by all circles.

Example:

INPUT:  20, 20, 21, 31, 24, 20, 70, 21, 31, 24, 20, 20, 21, 31, 24, 20, 20, 21, 31, 24

SORTED: 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 24, 24, 24, 24, 31, 31, 31, 31, 70

WINDOW MOVING FREQUENCY:
20: 15
21: 19
(detects gap, jumps)
60: 1
(detects gap, jumps, ends)

WINDOW #1: [11,31]: 19
WINDOW #2: [50, 70]: 1

AVERAGE: 10
50% AVERAGE: 5
WINDOW #1 OVER CUTOFF
WINDOW #2 UNDER CUTOFF

Example:

INPUT:  20, 20, 21, 31, 24, 20, 70, 21, 31, 24, 120, 120, 121, 131, 124, 120, 120, 121, 131, 124

SORTED: 20, 20, 20, 21, 21, 24, 24, 31, 31, 70, 120, 120, 120, 120, 121, 121, 124, 124, 131, 131

WINDOW MOVING FREQUENCY:
20: 7
(detects gap, jumps)
60: 1
(detects gap, jumps)
110: 4
111: 6
(detects gap, jumps)
114: 8
(detects gap, jumps)
121: 10

WINDOW #1: [111, 131]: 10
WINDOW #2: [10, 30]: 7
WINDOW #3: [50, 70]: 1

AVERAGE: 6
50% AVERAGE: 3

WINDOW #1 ABOVE CUTOFF
WINDOW #2 ABOVE CUTOFF
WINDOW #3 BELOW CUTOFF

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM