Java Math.min/max 性能

Question

编辑：maaartinus 给出了我正在寻找的答案，tmyklebu 关于这个问题的数据帮助很大，所以谢谢两位！ :)

我已经阅读了一些关于 HotSpot 如何在代码中注入一些“内在函数”的内容，特别是针对 Java 标准数学库（来自此处）

所以我决定尝试一下，看看 HotSpot 与直接进行比较有多大区别（特别是因为我听说 min/max 可以编译为无分支 asm）。

public class OpsMath {
    public static final int max(final int a, final int b) {
        if (a > b) {
            return a;
        }
        return b;
    }
}

这就是我的实现。 从另一个 SO 问题我读到使用三元运算符使用额外的寄存器，我没有发现执行 if 块和使用三元运算符（即 return ( a > b ) ? a : b ）之间的显着差异。

分配一个 8Mb 的 int 数组（即 200 万个值），并对其进行随机化，我进行了以下测试：

try ( final Benchmark bench = new Benchmark( "millis to max" ) )
    {
        int max = Integer.MIN_VALUE;

        for ( int i = 0; i < array.length; ++i )
        {
            max = OpsMath.max( max, array[i] );
            // max = Math.max( max, array[i] );
        }
    }

我在 try-with-resources 块中使用 Benchmark 对象。 当它完成时，它对对象调用 close() 并打印块完成所需的时间。 这些测试是通过在上面的代码中注释掉/注释掉最大调用来单独完成的。

'max' 被添加到基准块之外的列表中并稍后打印，以避免 JVM 优化整个块。

每次测试运行时，数组都会随机化。

运行测试 6 次，它给出了以下结果：

Java标准数学：

millis to max 9.242167 
millis to max 2.1566199999999998
millis to max 2.046396 
millis to max 2.048616  
millis to max 2.035761
millis to max 2.001044

在第一次运行后相当稳定，再次运行测试给出了类似的结果。

运算数学：

millis to max 8.65418 
millis to max 1.161559  
millis to max 0.955851 
millis to max 0.946642 
millis to max 0.994543 
millis to max 0.9469069999999999

同样，第一次运行后结果非常稳定。

问题是：为什么？ 那是相当大的不同。 我不知道为什么。 即使我完全像 Math.max() 一样实现我的 max() 方法（即 return (a >= b) ? a : b ）我仍然得到更好的结果！ 这没有道理。

眼镜：

CPU：英特尔 i5 2500，3,3Ghz。 Java 版本：JDK 8（3 月 18 日公开发布），x64。 Debian Jessie（测试版）x64。

我还没有尝试过 32 位 JVM。

编辑：按要求进行自包含测试。 添加了一行以强制 JVM 预加载 Math 和 OpsMath 类。 这消除了 OpsMath 测试第一次迭代的 18 毫秒成本。

// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " + 
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
    
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
    int max = Integer.MIN_VALUE;
    
    // Randomize the array.
    for ( int i = 0; i < array.length; ++i )
    {
        array[i] = r.nextInt();
    }
    
    final long start = System.nanoTime();
    for ( int i = 0; i < array.length; ++i )
    {
        max = Math.max( array[i], max );
            // OpsMath.max() method implemented as described.
        // max = OpsMath.max( array[i], max );
    }
    // Calc time.
    final double end = (System.nanoTime() - start);
    // Store results.
    times.add( Double.valueOf( end ) );
    results.add( Integer.valueOf(  max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
    System.out.println( "IT" + i + " result: " + results.get( i ) );
    System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}

Java Math.max 结果：

IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047

OpsMath.max 结果：

IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984

总体结果还是一样。 我尝试过只对数组进行一次随机化，并在同一个数组上重复测试，总体上我得到了更快的结果，但是 Java Math.max 和 OpsMath.max 之间的差异是相同的 2 倍。

Answer 1

很难说为什么Math.max比Ops.max慢，但很容易说明为什么这个基准强烈支持分支到条件移动：在第n次迭代中，

Math.max( array[i], max );

不等于max是array[n-1]大于所有先前元素的概率。 显然，这个概率随着n增加而变得越来越低，并且给定

final int[] array = new int[(8*1024*1024)/4];

大多数时候可以忽略不计。 条件移动指令对分支概率不敏感，它总是花费相同的时间来执行。 如果分支很难预测，则条件移动指令比分支预测更快。 另一方面，如果分支能够以高概率被很好地预测，则分支预测会更快。 目前，与分支的最佳和最差情况相比，我不确定条件移动的速度。 ¹

在你的情况下，除了前几个分支之外的所有分支都是相当可预测的。 从大约n == 10开始，使用条件移动是没有意义的，因为可以保证正确预测分支并且可以与其他指令并行执行（我猜你每次迭代只需要一个周期）。

这似乎发生在计算最小值/最大值或进行一些低效排序的算法（良好的分支可预测性意味着每步低熵）。

¹条件移动和预测分支都需要一个周期。 前者的问题在于它需要两个操作数，这需要额外的指令。 最后，当分支单元空闲时，关键路径可能会变长和/或 ALU 饱和。 通常，但并非总是如此，在实际应用中可以很好地预测分支； 这就是为什么首先发明了分支预测。

至于时序条件移动与分支预测最佳和最坏情况的血腥细节，请参阅下面评论中的讨论。 我自己的基准测试表明，当分支预测遇到最坏的情况时，条件移动明显快于分支预测，但我不能忽略相互矛盾的结果。 我们需要对究竟是什么造成差异进行一些解释。 更多的基准和/或分析可能会有所帮助。

Answer 2

当我在旧的 (1.6.0_27) JVM 上使用Math.max运行您的（适当修改的）代码时，热循环如下所示：

0x00007f4b65425c50: mov    %r11d,%edi         ;*getstatic array
                                              ; - foo146::bench@81 (line 40)
0x00007f4b65425c53: mov    0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov    0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov    0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov    0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov    0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov    0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov    0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov    0x1c(%rax,%rdx,4),%r14d  ;*iaload
                                              ; - foo146::bench@86 (line 40)
0x00007f4b65425c79: cmp    %edi,%r8d
0x00007f4b65425c7c: cmovl  %edi,%r8d
0x00007f4b65425c80: cmp    %r8d,%r10d
0x00007f4b65425c83: cmovl  %r8d,%r10d
0x00007f4b65425c87: cmp    %r10d,%ecx
0x00007f4b65425c8a: cmovl  %r10d,%ecx
0x00007f4b65425c8e: cmp    %ecx,%r14d
0x00007f4b65425c91: cmovl  %ecx,%r14d
0x00007f4b65425c95: cmp    %r14d,%esi
0x00007f4b65425c98: cmovl  %r14d,%esi
0x00007f4b65425c9c: cmp    %esi,%ebx
0x00007f4b65425c9e: cmovl  %esi,%ebx
0x00007f4b65425ca1: cmp    %ebx,%r9d
0x00007f4b65425ca4: cmovl  %ebx,%r9d
0x00007f4b65425ca8: cmp    %r9d,%r11d
0x00007f4b65425cab: cmovl  %r9d,%r11d         ;*invokestatic max
                                              ; - foo146::bench@88 (line 40)
0x00007f4b65425caf: add    $0x8,%edx          ;*iinc
                                              ; - foo146::bench@92 (line 39)
0x00007f4b65425cb2: cmp    $0x1ffff9,%edx
0x00007f4b65425cb8: jl     0x00007f4b65425c50

除了奇怪放置的 REX 前缀（不知道那是什么），这里有一个循环展开 8 次，它主要执行您期望的操作——加载、比较和条件移动。 有趣的是，如果将参数的顺序交换为max ，它会在此处输出另一种 8 深cmovl链。 我猜它不知道如何在循环完成后生成要合并的cmovl s 或 8 个单独的cmovl链的 3 深树。

使用显式的OpsMath.max ，它变成了展开 8 次的条件和无条件分支的最大范围。 我不会发布循环； 它不漂亮。 基本上，上面的每个mov/cmp/cmovl都被分解为加载、比较和条件跳转到mov和jmp发生的位置。 有趣的是，如果您将参数的顺序交换为max ，它会输出一个 8 深的cmovle链。 编辑：正如@maaartinus 所指出的那样，在某些机器上，分支最多的分支实际上更快，因为分支预测器对它们起作用，而这些分支是经过良好预测的分支。

我会犹豫从这个基准中得出结论。 你有基准建设问题； 你有很多次运行它比你，你必须因素代码不同，如果你想一次热点最快的代码。 除了包装器代码之外，您并没有衡量您的max有多快，或者 Hotspot 了解您正在尝试做什么，或任何其他有价值的东西。 max两种实现都会导致代码对于任何类型的直接测量来说都太快了，在更大的程序上下文中没有意义。

Answer 3

使用 JDK 8：

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

在 Ubuntu 13.10 上

我运行了以下内容：

import java.util.Random;
import java.util.function.BiFunction;

public class MaxPerformance {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array;

  public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
    this.max = max;
    this.array = array;
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);

    m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);

    // total time over number of calls to max
    return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
  }

  public double averageTime(int repeats) {
    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
      cumulativeTime += time();
    return (double) cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array = new int[size];
    for (int i = 0; i < size; i++) array[i] = random.nextInt();

    double tMath = new MaxPerformance(Math::max, array).averageTime(100);
    double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
    double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

我得到了以下结果（每次调用 max 所需的平均纳秒数）：

Java Math: 15.443555810000003
Alt 1:     14.968298919999997
Alt 2:     16.442204045

所以从长远来看，看起来第二个实现是最快的，尽管幅度相对较小。

为了进行更科学的测试，计算每个调用独立于前一个调用的元素对的最大值是有意义的。 这可以通过使用两个随机数组而不是本基准测试中的一个来完成：

import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array1, array2;

  public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
    this.max = max;
    this.array1 = array1;
    this.array2 = array2;
    if (array1.length != array2.length) throw new IllegalArgumentException();
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
    m += m; // to avoid optimizations!

    return ((double) (System.nanoTime() - start)) / (double) array1.length;
  }

  public double averageTime(int repeats) {
    // warm up rounds:
    double tmp = 0;
    for (int i = 0; i < 10; i++) tmp += time();
    tmp *= 2.0;

    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
        cumulativeTime += time();
    return cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array1 = new int[size];
    int[] array2 = new int[size];
    for (int i = 0; i < size; i++) {
        array1[i] = random.nextInt();
        array2[i] = random.nextInt();
    }

    double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
    double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
    double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

这给了我：

Java Math: 15.346468170000005
Alt 1:     16.378737519999998
Alt 2:     20.506475350000006

您的测试设置方式会对结果产生巨大影响。 在这种情况下，JDK 版本似乎是最快的。 与之前的情况相比，这次的幅度相对较大。

有人提到了Caliper。 好吧，如果您阅读wiki ，他们对微基准测试说的第一件事就是不要这样做：这是因为一般来说很难获得准确的结果。 我认为这是一个明显的例子。

Java Math.min/max 性能

问题描述

3 个解决方案

解决方案1
12 已采纳 2014-03-31 06:42:04

解决方案2
3 2014-03-31 04:30:50

解决方案3
1 2014-03-31 14:19:13

Java Math.min/max 性能

问题描述

3 个解决方案

解决方案1 12 已采纳 2014-03-31 06:42:04

解决方案2 3 2014-03-31 04:30:50

解决方案3 1 2014-03-31 14:19:13

解决方案1
12 已采纳 2014-03-31 06:42:04

解决方案2
3 2014-03-31 04:30:50

解决方案3
1 2014-03-31 14:19:13