简体   繁体   English

Java Math.min/max 性能

[英]Java Math.min/max performance

EDIT: maaartinus gave the answer I was looking for and tmyklebu's data on the problem helped a lot, so thanks both!编辑:maaartinus 给出了我正在寻找的答案,tmyklebu 关于这个问题的数据帮助很大,所以谢谢两位! :) :)

I've read a bit about how HotSpot has some "intrinsics" that injects in the code, specially for Java standard Math libs ( from here )我已经阅读了一些关于 HotSpot 如何在代码中注入一些“内在函数”的内容,特别是针对 Java 标准数学库( 来自此处

So I decided to give it a try, to see how much difference HotSpot could make against doing the comparison directly (specially since I've heard min/max can compile to branchless asm).所以我决定尝试一下,看看 HotSpot 与直接进行比较有多大区别(特别是因为我听说 min/max 可以编译为无分支 asm)。

public class OpsMath {
    public static final int max(final int a, final int b) {
        if (a > b) {
            return a;
        }
        return b;
    }
}

That's my implementation.这就是我的实现。 From another SO question I've read that using the ternary operator uses an extra register, I haven't found significant differences between doing an if block and using a ternary operator (ie, return ( a > b ) ? a : b ).从另一个 SO 问题我读到使用三元运算符使用额外的寄存器,我没有发现执行 if 块和使用三元运算符(即 return ( a > b ) ? a : b )之间的显着差异。

Allocating a 8Mb int array (ie, 2 million values), and randomizing it, I do the following test:分配一个 8Mb 的 int 数组(即 200 万个值),并对其进行随机化,我进行了以下测试:

try ( final Benchmark bench = new Benchmark( "millis to max" ) )
    {
        int max = Integer.MIN_VALUE;

        for ( int i = 0; i < array.length; ++i )
        {
            max = OpsMath.max( max, array[i] );
            // max = Math.max( max, array[i] );
        }
    }

I'm using a Benchmark object in a try-with-resources block.我在 try-with-resources 块中使用 Benchmark 对象。 When it finishes, it calls close() on the object and prints the time the block took to complete.当它完成时,它对对象调用 close() 并打印块完成所需的时间。 The tests are done separately by commenting in/out the max calls in the code above.这些测试是通过在上面的代码中注释掉/注释掉最大调用来单独完成的。

'max' is added to a list outside the benchmark block and printed later, so to avoid the JVM optimizing the whole block away. 'max' 被添加到基准块之外的列表中并稍后打印,以避免 JVM 优化整个块。

The array is randomized each time the test runs.每次测试运行时,数组都会随机化。

Running the test 6 times, it gives these results:运行测试 6 次,它给出了以下结果:

Java standard Math: Java标准数学:

millis to max 9.242167 
millis to max 2.1566199999999998
millis to max 2.046396 
millis to max 2.048616  
millis to max 2.035761
millis to max 2.001044 

So fairly stable after the first run, and running the tests again gives similar results.在第一次运行后相当稳定,再次运行测试给出了类似的结果。

OpsMath:运算数学:

millis to max 8.65418 
millis to max 1.161559  
millis to max 0.955851 
millis to max 0.946642 
millis to max 0.994543 
millis to max 0.9469069999999999 

Again, very stable results after the first run.同样,第一次运行后结果非常稳定。

The question is: Why?问题是:为什么? Thats quite a big difference there.那是相当大的不同。 And I have no idea why.我不知道为什么。 Even if I implement my max() method exactly like Math.max() (ie, return (a >= b) ? a : b ) I still get better results!即使我完全像 Math.max() 一样实现我的 max() 方法(即 return (a >= b) ? a : b )我仍然得到更好的结果! It makes no sense.这没有道理。

Specs:眼镜:

CPU: Intel i5 2500, 3,3Ghz. CPU:英特尔 i5 2500,3,3Ghz。 Java Version: JDK 8 (public march 18 release), x64. Java 版本:JDK 8(3 月 18 日公开发布),x64。 Debian Jessie (testing release) x64. Debian Jessie(测试版)x64。

I have yet to try with 32 bit JVM.我还没有尝试过 32 位 JVM。

EDIT: Self contained test as requested.编辑:按要求进行自包含测试。 Added a line to force the JVM to preload Math and OpsMath classes.添加了一行以强制 JVM 预加载 Math 和 OpsMath 类。 That eliminates the 18ms cost of the first iteration for OpsMath test.这消除了 OpsMath 测试第一次迭代的 18 毫秒成本。

// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " + 
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
    
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
    int max = Integer.MIN_VALUE;
    
    // Randomize the array.
    for ( int i = 0; i < array.length; ++i )
    {
        array[i] = r.nextInt();
    }
    
    final long start = System.nanoTime();
    for ( int i = 0; i < array.length; ++i )
    {
        max = Math.max( array[i], max );
            // OpsMath.max() method implemented as described.
        // max = OpsMath.max( array[i], max );
    }
    // Calc time.
    final double end = (System.nanoTime() - start);
    // Store results.
    times.add( Double.valueOf( end ) );
    results.add( Integer.valueOf(  max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
    System.out.println( "IT" + i + " result: " + results.get( i ) );
    System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}

Java Math.max result: Java Math.max 结果:

IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047

OpsMath.max result: OpsMath.max 结果:

IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984

Still the same overall results.总体结果还是一样。 I've tried with randomizing the array only once, and repeating the tests over the same array, I get faster results overall, but the same 2x difference between Java Math.max and OpsMath.max.我尝试过只对数组进行一次随机化,并在同一个数组上重复测试,总体上我得到了更快的结果,但是 Java Math.max 和 OpsMath.max 之间的差异是相同的 2 倍。

It's hard to tell why Math.max is slower than a Ops.max , but it's easy to tell why this benchmark strongly favors branching to conditional moves: On the n -th iteration, the probability of很难说为什么Math.maxOps.max慢,但很容易说明为什么这个基准强烈支持分支到条件移动:在第n次迭代中,

Math.max( array[i], max );

being not equal to max is the probability that array[n-1] is bigger than all previous elements.不等于maxarray[n-1]大于所有先前元素的概率。 Obviously, this probability gets lower and lower with growing n and given显然,这个概率随着n增加而变得越来越低,并且给定

final int[] array = new int[(8*1024*1024)/4];

it's rather negligible most of the time.大多数时候可以忽略不计。 The conditional move instruction is insensitive to the branching probability, it always take the same amount of time to execute.条件移动指令对分支概率不​​敏感,它总是花费相同的时间来执行。 The conditional move instruction is faster than branch prediction if the branch is very hard to predict.如果分支很难预测,条件移动指令比分支预测更快。 On the other hand, branch prediction is faster if the branch can be predicted well with high probability.另一方面,如果分支能够以高概率被很好地预测,则分支预测会更快。 Currently, I'm unsure about the speed of conditional move compared to best and worst case of branching.目前,与分支的最佳和最差情况相比,我不确定条件移动的速度。 1 1

In your case all but first few branches are fairly predictable.在你的情况下,除了前几个分支之外的所有分支都是相当可预测的。 From about n == 10 onward, there's no point in using conditional moves as the branch is rather guaranteed to be predicted correctly and can execute in parallel with other instructions (I guess you need exactly one cycle per iteration).从大约n == 10开始,使用条件移动是没有意义的,因为可以保证正确预测分支并且可以与其他指令并行执行(我猜你每次迭代只需要一个周期)。

This seems to happen for algorithms computing minimum/maximum or doing some inefficient sorting (good branch predictability means low entropy per step).这似乎发生在计算最小值/最大值或进行一些低效排序的算法(良好的分支可预测性意味着每步低熵)。


1 Both conditional move and predicted branch take one cycle. 1条件移动和预测分支都需要一个周期。 The problem with the former is that it needs its two operands and this takes additional instruction.前者的问题在于它需要两个操作数,这需要额外的指令。 In the end the critical path may get longer and/or the ALUs saturated while the branching unit is idle.最后,当分支单元空闲时,关键路径可能会变长和/或 ALU 饱和。 Often, but not always, branches can be predicted well in practical applications;通常,但并非总是如此,在实际应用中可以很好地预测分支; that's why branch prediction was invented in the first place.这就是为什么首先发明了分支预测。

As for the gory details of timing conditional move vs. branch prediction best and worst case, see the discussion below in comments.至于时序条件移动与分支预测最佳和最坏情况的血腥细节,请参阅下面评论中的讨论。 My my own benchmark shows that conditional move is significantly faster than branch prediction when branch prediction encounters its worst case, but I can't ignore contradictory results . 我自己的基准测试表明,当分支预测遇到最坏的情况时,条件移动明显快于分支预测,但我不能忽略相互矛盾的结果 We need some explanation for what exactly makes the difference.我们需要对究竟是什么造成差异进行一些解释。 Some more benchmarks and/or analysis could help.更多的基准和/或分析可能会有所帮助。

When I run your (suitably-modified) code using Math.max on an old (1.6.0_27) JVM, the hot loop looks like this:当我在旧的 (1.6.0_27) JVM 上使用Math.max运行您的(适当修改的)代码时,热循环如下所示:

0x00007f4b65425c50: mov    %r11d,%edi         ;*getstatic array
                                              ; - foo146::bench@81 (line 40)
0x00007f4b65425c53: mov    0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov    0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov    0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov    0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov    0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov    0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov    0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov    0x1c(%rax,%rdx,4),%r14d  ;*iaload
                                              ; - foo146::bench@86 (line 40)
0x00007f4b65425c79: cmp    %edi,%r8d
0x00007f4b65425c7c: cmovl  %edi,%r8d
0x00007f4b65425c80: cmp    %r8d,%r10d
0x00007f4b65425c83: cmovl  %r8d,%r10d
0x00007f4b65425c87: cmp    %r10d,%ecx
0x00007f4b65425c8a: cmovl  %r10d,%ecx
0x00007f4b65425c8e: cmp    %ecx,%r14d
0x00007f4b65425c91: cmovl  %ecx,%r14d
0x00007f4b65425c95: cmp    %r14d,%esi
0x00007f4b65425c98: cmovl  %r14d,%esi
0x00007f4b65425c9c: cmp    %esi,%ebx
0x00007f4b65425c9e: cmovl  %esi,%ebx
0x00007f4b65425ca1: cmp    %ebx,%r9d
0x00007f4b65425ca4: cmovl  %ebx,%r9d
0x00007f4b65425ca8: cmp    %r9d,%r11d
0x00007f4b65425cab: cmovl  %r9d,%r11d         ;*invokestatic max
                                              ; - foo146::bench@88 (line 40)
0x00007f4b65425caf: add    $0x8,%edx          ;*iinc
                                              ; - foo146::bench@92 (line 39)
0x00007f4b65425cb2: cmp    $0x1ffff9,%edx
0x00007f4b65425cb8: jl     0x00007f4b65425c50

Apart from the weirdly-placed REX prefix (not sure what that's about), here you have a loop that's been unrolled 8 times that does mostly what you'd expect---loads, comparisons, and conditional moves.除了奇怪放置的 REX 前缀(不知道那是什么),这里有一个循环展开 8 次,它主要执行您期望的操作——加载、比较和条件移动。 Interestingly, if you swap the order of the arguments to max , here it outputs the other kind of 8-deep cmovl chain.有趣的是,如果将参数的顺序交换为max ,它会在此处输出另一种 8 深cmovl链。 I guess it doesn't know how to generate a 3-deep tree of cmovl s or 8 separate cmovl chains to be merged after the loop is done.我猜它不知道如何在循环完成后生成要合并的cmovl s 或 8 个单独的cmovl链的 3 深树。

With the explicit OpsMath.max , it turns into a ratsnest of conditional and unconditional branches that's unrolled 8 times.使用显式的OpsMath.max ,它变成了展开 8 次的条件和无条件分支的最大范围。 I'm not going to post the loop;我不会发布循环; it's not pretty.它不漂亮。 Basically each mov/cmp/cmovl above gets broken into a load, a compare and a conditional jump to where a mov and a jmp happen.基本上,上面的每个mov/cmp/cmovl都被分解为加载、比较和条件跳转到movjmp发生的位置。 Interestingly, if you swap the order of the arguments to max , here it outputs an 8-deep cmovle chain instead.有趣的是,如果您将参数的顺序交换为max ,它会输出一个 8 深的cmovle链。 EDIT : As @maaartinus points out, said ratsnest of branches is actually faster on some machines because the branch predictor works its magic on them and these are well-predicted branches.编辑:正如@maaartinus 所指出的那样,在某些机器上,分支最多的分支实际上更快,因为分支预测器对它们起作用,而这些分支是经过良好预测的分支。

I would hesitate to draw conclusions from this benchmark.我会犹豫从这个基准中得出结论。 You have benchmark construction issues;你有基准建设问题; you have to run it a lot more times than you are and you have to factor your code differently if you want to time Hotspot's fastest code.你有很多次运行它比你,你必须因素代码不同,如果你想一次热点最快的代码。 Beyond the wrapper code, you aren't measuring how fast your max is, or how well Hotspot understands what you're trying to do, or anything else of value here.除了包装器代码之外,您并没有衡量您的max有多快,或者 Hotspot 了解您正在尝试做什么,或任何其他有价值的东西。 Both implementations of max will result in code that's entirely too fast for any sort of direct measurement to be meaningful within the context of a larger program. max两种实现都会导致代码对于任何类型的直接测量来说都太快了,在更大的程序上下文中没有意义。

Using JDK 8:使用 JDK 8:

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

On Ubuntu 13.10在 Ubuntu 13.10 上

I ran the following:我运行了以下内容:

import java.util.Random;
import java.util.function.BiFunction;

public class MaxPerformance {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array;

  public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
    this.max = max;
    this.array = array;
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);

    m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);

    // total time over number of calls to max
    return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
  }

  public double averageTime(int repeats) {
    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
      cumulativeTime += time();
    return (double) cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array = new int[size];
    for (int i = 0; i < size; i++) array[i] = random.nextInt();

    double tMath = new MaxPerformance(Math::max, array).averageTime(100);
    double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
    double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

And I got the following results (average nanoseconds taken for each call to max):我得到了以下结果(每次调用 max 所需的平均纳秒数):

Java Math: 15.443555810000003
Alt 1:     14.968298919999997
Alt 2:     16.442204045

So on a long run it looks like the second implementation is the fastest, although by a relatively small margin.所以从长远来看,看起来第二个实现是最快的,尽管幅度相对较小。

In order to have a slightly more scientific test, it makes sense to compute the max of pairs of elements where each call is independent from the previous one.为了进行更科学的测试,计算每个调用独立于前一个调用的元素对的最大值是有意义的。 This can be done by using two randomized arrays instead of one as in this benchmark:这可以通过使用两个随机数组而不是本基准测试中的一个来完成:

import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array1, array2;

  public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
    this.max = max;
    this.array1 = array1;
    this.array2 = array2;
    if (array1.length != array2.length) throw new IllegalArgumentException();
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
    m += m; // to avoid optimizations!

    return ((double) (System.nanoTime() - start)) / (double) array1.length;
  }

  public double averageTime(int repeats) {
    // warm up rounds:
    double tmp = 0;
    for (int i = 0; i < 10; i++) tmp += time();
    tmp *= 2.0;

    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
        cumulativeTime += time();
    return cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array1 = new int[size];
    int[] array2 = new int[size];
    for (int i = 0; i < size; i++) {
        array1[i] = random.nextInt();
        array2[i] = random.nextInt();
    }

    double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
    double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
    double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

Which gave me:这给了我:

Java Math: 15.346468170000005
Alt 1:     16.378737519999998
Alt 2:     20.506475350000006

The way your test is set up makes a huge difference on the results.您的测试设置方式会对结果产生巨大影响。 The JDK version seems to be the fastest in this scenario.在这种情况下,JDK 版本似乎是最快的。 This time by a relatively large margin compared to the previous case.与之前的情况相比,这次的幅度相对较大。

Somebody mentioned Caliper.有人提到了Caliper。 Well if you read the wiki , one the first things they say about micro-benchmarking is not to do it: this is because it's hard to get accurate results in general.好吧,如果您阅读wiki ,他们对微基准测试说的第一件事就是要这样做:这是因为一般来说很难获得准确的结果。 I think this is a clear example of that.我认为这是一个明显的例子。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM