简体   繁体   中英

Apache Commons Math 2.2 Percentile bug?

I am not 100% sure if this is a bug or I am not doing something right but if you give Percentile a large amount of data that is the consistent of the same value (see code below) the evaluate method takes a very long time. If you give Percentile the random values evaluate takes a considerable shorter time.

As noted below Median is a subcalss of Percentile.

Percentile java doc

private void testOne(){
  int size = 200000;
  int sameValue = 100;
  List<Double> list = new ArrayList<Double>();

  for (int i = 0; i < size; i++)
  {
    list.add((double)sameValue);
  }
  Median m = new Median();
  m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));

  long start = System.currentTimeMillis();
  System.out.println("Start:"+ start);

  double result = m.evaluate();

  System.out.println("Result:" + result);
  System.out.println("Time:"+ (System.currentTimeMillis()- start));
}


private void testTwo(){
  int size = 200000;
  List<Double> list = new ArrayList<Double>();

  Random r = new Random();

  for (int i = 0; i < size; i++)
  {
    list.add(r.nextDouble() * 100.0);
  }
  Median m = new Median();
  m.setData(ArrayUtils.toPrimitive(list.toArray(new Double[0])));

  long start = System.currentTimeMillis();
  System.out.println("Start:"+ start);

  double result = m.evaluate();

  System.out.println("Result:" + result);
  System.out.println("Time:"+ (System.currentTimeMillis()- start));
}

This is a known issue between versions 2.0 and 2.1 and has been fixed for version 3.1 .

Version 2.0 did indeed involve sorting the data, but in 2.1 they seemed to have switched to a selection algorithm . However, a bug in their implementation of that led to some bad behavior for data with lots of identical values. Basically they used >= and <= instead of > and <.

It's well known that some algorithms can exhibit slower performance for certain data sets. Performance can actually be improved by randomizing the data set before performing the operation.

Since percentile probably involves sorting the data, I'm guessing that your "bug" is not really a defect in the code, but rather the manifestation of one of the slower performing data sets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM