简体   繁体   中英

Finding median of a set of circular data

I would like to write a C++ function which finds the median of an array of circular data. For example, consider the reading from a compass where the readings are assumed to be in [0,360). Though 1 & 359 appears to be far away, they are very close due to the circular nature of the reading.

Finding median of N-elements in ordinary data is as follows. 1. sort the data of N-elements (ascending or descending order) 2. If N is odd, median is the (N+1)/2 th element in the sorted array. 3. If N is even, median is the average of the N/2 th and N/2+1 th elements in the sorted array.

However, the wrap around problem in the circular data takes the problem to a different dimension and the solution non-trivial.

A similar question to find mean from circular data is explained here How do you calculate the average of a set of circular data? The suggestion in the above link is to find the unit vector corresponding to each angle and find the average. However, median requires sorting the data and sorting of vectors don't make any sense in this context. Hence I don't think we can use the proposed scheme to find median!

For definition and discussion of circular median see

NI Fisher's 'Statistical Analysis of Circular Data', Cambridge Univ. Press 1993

and the discussion surrounding equations 2.32 and 2.33. For multi-modal or isotropic data a unique median may not exist.

Find an axis that divides the data into 2 equal groups and choose the end of the axis at the smaller value of the angle. If the sample size is odd the median will be a data point, otherwise it will be the midpoint of 2 data points.

There are packages in other languages (eg R, MatLab) that would help provide test values for any function you write.

eg https://www.rdocumentation.org/packages/circular/versions/0.4-93

See in particular median.circular and medianHL.circular

or

Berens, Philipp. 'CircStat: A MATLAB Toolbox for Circular Statistics'. Journal of Statistical Software 31, no. 1 (23 September 2009): 1–21. https://doi.org/10.18637/jss.v031.i10 .

and see circ_median

With your vector of angular datapoints (ie vector of numbers from 0 to 259), create two new vectors, I'll call them x and y . These two new vectors are the sine and cosine respectively of your angular datapoints.

That is, x[n] = cos(data[n]) and y[n] = sin(data[n]) where data is your angular data vector and n is however many datapoints there are.

Next, add up all the values in the x vector to get a single value, call it say sum_x and add up all the values in the y vector to get a another single value, call it sum_y .

Now you can do tangent-inverse (eg atan(sum_y/sum_x) ) to get a new value. And this value is very meaningful. This value is basically telling you which direction your data is "pointing", ie where the majority of your data exists. NOTE: You must be careful of dividing by 0 (when sum_x =0) and when the indeterminate forms occurs (when both sum_x =0 and sum_y =0). The indeterminate form just means your data is evenly distributed, in which case the median is meaningless, and when sum_x =0 but sum_y !=0, then it is effectively atan(inf) or atan(-inf) , both of which are known.

EDIT:

My previous answer needed some tweaking after this point.

From here, it is easy. Take the value you got in the previous step ( atan(sum_y/sum_x) ) and add 180 degrees to that value. This is your reference point of where your data starts and ends. From here, you can sort your angular data with this reference point as both the starting and ending point, and find the median of that data.

Two properties of median allow inventing two distinct algorithms for median finding.

1) Median minimizes sum of absolute distance to all other elements -- O(n^2) algo:

for (i = 0; i < N; i++)
{
     sum = 0;
     for (j = 0; j < N; j++)
        sum += abs(item[i] - item[j]) % 360;
     if (sum < best_so_far) { best_so_far = sum; index = i; }
}

2) Median satisfies that half of items are less and half are greater

  • sort the items
  • locate the first set of items (i=0...I), satisfying either that I <= N/2, OR item[I] > i + 180
  • if the condition for median is not satisfied, advance either i, or I.
  • requires O(N*log N) for sorting and O(N) for the next scan

Of course in cyclical data all items (and all items inbetween data points) can be a proper candidate for the median.

I've actually given this topic way more thought than is healthy so I'll share my thoughts and findings here. Maybe someone will have a similar problem and find this useful.

I haven't used C++ in many years so please forgive me if I write all the code in C#. I believe a fluent C++ speaker can pretty easily translate the algorithms.

Circular mean

First, let's define the circular mean . It's calculated by converting your points to radians, where your period (256, 360 or whatever - the value that is interpreted to be the same as zero) is scaled to 2*pi . You then calculate the sine and cosine of those radian values. Those are the y and x coordinates of your values on a unit circle. You then sum up all the sines and cosines and calculate atan2. This gives you the average angle, which can be easily converted back to your data point by dividing with the scaling factor.

var scalingFactor = 2 * Math.PI / period;

var sines = 0.0;
var cosines = 0.0;
foreach (var value in inputs)
{
    var radians = value * scalingFactor;
    sines += Math.Sin(radians);
    cosines += Math.Cos(radians);
}

var circularMean = Math.Atan2(sines, cosines) / scalingFactor;

if (circularMean >= 0)
    return circularMean;
else
    return circularMean + period;

Marginal circular median

The simplest approach to a circular median is just a modified way of handling the circular mean.

The circular median can be calculated in a similar way, by just finding the median of the sines and cosines instead of the sums, and calculating the atan2 of that. This way, you are finding the marginal median of the circle points and taking its angle as a result.

var scalingFactor = 2 * Math.PI / period;

var sines = new List<double>();
var cosines = new List<double>();
foreach (var value in inputs)
{
    var radians = value * scalingFactor;
    sines.Add(Math.Sin(radians));
    cosines.Add(Math.Cos(radians));
}

var circularMedian = Math.Atan2(Median(sines), Median(cosines)) / scalingFactor;

if (circularMedian >= 0)
    return circularMedian;
else
    return circularMedian + period;

This approach is O(n), robust to outliers and very simple to implement. It may suit your purposes well enough, but it has a problem: rotating the input points will give you different results. Depending on the distribution of your input data, it may or may not be a problem.

Circular arc median

To understand this other approach, you need to stop thinking of means and medians in terms of "this is how it's calculated", but in terms of what the resulting values actually represent.

For non-cyclic data, you get the mean by summing up all the values and dividing by the number of elements. What this number represents, though, is the value with the minimal sum of all squared distances to data elements. (I hear statisticians call this value the L2 estimate of location, but a statistician should probably confirm or deny this.)

Likewise for median. You get it by finding the data element that would end up in the middle if all data were sorted (ideally, using an O(n) selection algorithm , like nth_element in C++). What this number is, though, is a value that has the minimal sum of all absolute (non-squared!) distances to data elements. (Supposedly, this value is called an L1 estimate of location.)

Sorting circular data doesn't help you find a middle, so the usual way of thinking about medians doesn't work, but you can still find this point that minimizes the sum of absolute distances from all data points. Here's the algorithm that I came up with, that runs in O(n) time assuming the input data is normalized to >= 0 and < period, and then sorted. (If you need to do this sorting as part of your calculation, then the runtime is O(n log n).)

It works by going through all the data points and keeping track of the sum of distances. When you shift to the right data point by a distance D, the sum of distances to all the left points increases by D*LeftCount and the sum of all distances to all the right points decreases by D*RightCount . Then, if some of the left points are now actually the right points, because their left distance is larger than period/2 , you subtract their previous distance and add the new, correct distance.

For comparing the current sum to the best sum, I added a bit of tolerance to guard against inexact floating point arithmetic.

There may be multiple or infinitely many points that satisfy the minimum distances condition. With non-circular medians with even number of values, the median can be any value between the two central values. It's usually taken to be the average of those two central values, so I took the similar approach with this median algorithm. I find all data points that minimize the distances and then just calculate the circular mean of those points.

// Requires a sorted list with values normalized to [0,period).

// Doing an initialization pass:
//   * candidate is the lowest number
//   * finding the index where the circle with this candidate starts
//   * calculating the score for this candidate - the sum of absolute distances
//   * counting the number of values to the left of the candidate
int i;
var candidate = list[0];
var distanceSum = 0.0;
for (i = 1; i < list.Count; ++i)
{
    if (list[i] >= candidate + period / 2)
        break;
    distanceSum += list[i] - candidate;
}
var leftCount = list.Count - i;
var circleStart = i;
if (circleStart == list.Count)
    circleStart = 0;
else
    for (; i < list.Count; ++i)
        distanceSum += candidate + period - list[i];

var previousCandidate = candidate;
var bestCandidates = new List<double> { candidate };
var bestDistanceSum = distanceSum;
var equalityTolerance = period * 1e-10;

for (i = 1; i < list.Count; ++i)
{
    candidate = list[i];

    // A formula for correcting the distance given the movement to the right.
    // It doesn't take into account that some values may have wrapped to the other side of the circle.
    ++leftCount;
    distanceSum += (2 * leftCount - list.Count) * (candidate - previousCandidate);

    // Counting all the values that wrapped to the other side of the circle
    // and correcting the sum of distances from the candidate.
    if (i <= circleStart)
        while (list[circleStart] < candidate + period / 2)
        {
            --leftCount;
            distanceSum += 2 * (list[circleStart] - candidate) - period;
            ++circleStart;
            if (circleStart == list.Count)
            {
                circleStart = 0;
                break; // Letting the next loop continue.
            }
        }
    if (i > circleStart)
        while (list[circleStart] < candidate - period / 2)
        {
            --leftCount;
            distanceSum += 2 * (list[circleStart] - candidate) + period;
            ++circleStart;
        }

    // Comparing current sum to the best one, using the given tolerance.
    if (distanceSum <= bestDistanceSum + equalityTolerance)
    {
        if (distanceSum >= bestDistanceSum - equalityTolerance)
        {
            // The numbers are close, so using their average as the next best.
            bestDistanceSum = (bestCandidates.Count * bestDistanceSum + distanceSum) / (bestCandidates.Count + 1);
        }
        else
        {
            // The new number is significantly better, clearing.
            bestDistanceSum = distanceSum;
            bestCandidates.Clear();
        }
        bestCandidates.Add(candidate);
    }

    previousCandidate = candidate;
}

if (bestCandidates.Count == 1)
    return bestCandidates[0];
else
    return CircularMean(bestCandidates, period);

Geometric circular median

There is an inconsistency in the previous algorithm, in the way the median is defined in relation to the circular mean. The circular mean minimizes the sum of squared euclidian distances between points on a circle. In other words, it looks at the straight lines connecting points on a circle, cutting through the circle.

The arc median, as I calculate it above, looks at the arc distances: how far the points are to each other by moving on the perimeter of the circle, not by taking a straight line between them.

I have thought about how to address this issue, if it bothers you, but I haven't really done any experiments so I can't claim the following method works. In short, I believe you could use a modification of the Iteratively reweighted least squares algorithm (IRLS) , which is what is usually used to calculate geometric medians .

The idea is to pick a starting value (for instance, the circular mean or the arc median presented above), and calculate the euclidean distance to each point: Di = sqrt(dxi^2 + dyi^2). Circular mean will minimize the squares of those distances, so the weights of each point should cancel out the square and reset to just D: Wi = Di / Di^2, which is just Wi = 1 / Di.

With these weights, calculate the weighted circular mean (same as the circular mean, but multiply each sine and cosine by the weight of that point before summing them up) and repeat the process. Repeat until enough iterations have passed or until the result stops changing much.

The problem with this algorithm is that it has a division by zero if the current solution falls exactly on a data point. Even if the distance isn't exactly zero, the solution will stop moving if you hit close enough to the point because the weight will become enormous compared to all the other ones. This can be fixed by adding a small fixed offset to the distance before dividing by it. This will make the solution suboptimal, but at least it won't stop on a wrong point.

It will still take some number of iterations to dig itself out of that wrong point unless the offset is relatively large, and the final solution is worse the bigger the offset is. So the best way would probably be to start with a fairly large offset and then progressively making it smaller for each next iteration.

It is not possible to canonically extend the concept of median to circular data. For the sake of simplicity lets consider numbers in [0 10) and as an example the (already ordered) set { 1 3 5 7 8 } . Depending on how you rotate the array you get different values for the median:

1 3 5 7 8    -> 5
3 5 7 8 1    -> 7
5 7 8 1 3    -> 8
...etc...

and any is as good as the other.

I am not claiming that it is not possible to define a median on circular data. I am just claiming that the "normal" median cannot be extended to that case in a meaningful way without adding additional constraints or making an arbitrary choice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM