Java Math.min/max 性能

Question

編輯：maaartinus 給出了我正在尋找的答案，tmyklebu 關於這個問題的數據幫助很大，所以謝謝兩位！ :)

我已經閱讀了一些關於 HotSpot 如何在代碼中注入一些“內在函數”的內容，特別是針對 Java 標准數學庫（來自此處）

所以我決定嘗試一下，看看 HotSpot 與直接進行比較有多大區別（特別是因為我聽說 min/max 可以編譯為無分支 asm）。

public class OpsMath {
    public static final int max(final int a, final int b) {
        if (a > b) {
            return a;
        }
        return b;
    }
}

這就是我的實現。 從另一個 SO 問題我讀到使用三元運算符使用額外的寄存器，我沒有發現執行 if 塊和使用三元運算符（即 return ( a > b ) ? a : b ）之間的顯着差異。

分配一個 8Mb 的 int 數組（即 200 萬個值），並對其進行隨機化，我進行了以下測試：

try ( final Benchmark bench = new Benchmark( "millis to max" ) )
    {
        int max = Integer.MIN_VALUE;

        for ( int i = 0; i < array.length; ++i )
        {
            max = OpsMath.max( max, array[i] );
            // max = Math.max( max, array[i] );
        }
    }

我在 try-with-resources 塊中使用 Benchmark 對象。 當它完成時，它對對象調用 close() 並打印塊完成所需的時間。 這些測試是通過在上面的代碼中注釋掉/注釋掉最大調用來單獨完成的。

'max' 被添加到基准塊之外的列表中並稍后打印，以避免 JVM 優化整個塊。

每次測試運行時，數組都會隨機化。

運行測試 6 次，它給出了以下結果：

Java標准數學：

millis to max 9.242167 
millis to max 2.1566199999999998
millis to max 2.046396 
millis to max 2.048616  
millis to max 2.035761
millis to max 2.001044

在第一次運行后相當穩定，再次運行測試給出了類似的結果。

運算數學：

millis to max 8.65418 
millis to max 1.161559  
millis to max 0.955851 
millis to max 0.946642 
millis to max 0.994543 
millis to max 0.9469069999999999

同樣，第一次運行后結果非常穩定。

問題是：為什么？ 那是相當大的不同。 我不知道為什么。 即使我完全像 Math.max() 一樣實現我的 max() 方法（即 return (a >= b) ? a : b ）我仍然得到更好的結果！ 這沒有道理。

眼鏡：

CPU：英特爾 i5 2500，3,3Ghz。 Java 版本：JDK 8（3 月 18 日公開發布），x64。 Debian Jessie（測試版）x64。

我還沒有嘗試過 32 位 JVM。

編輯：按要求進行自包含測試。 添加了一行以強制 JVM 預加載 Math 和 OpsMath 類。 這消除了 OpsMath 測試第一次迭代的 18 毫秒成本。

// Constant nano to millis.
final double TO_MILLIS = 1.0d / 1000000.0d;
// 8Mb alloc.
final int[] array = new int[(8*1024*1024)/4];
// Result and time array.
final ArrayList<Integer> results = new ArrayList<>();
final ArrayList<Double> times = new ArrayList<>();
// Number of tests.
final int itcount = 6;
// Call both Math and OpsMath method so JVM initializes the classes.
System.out.println("initialize classes " + 
OpsMath.max( Math.max( 20.0f, array.length ), array.length / 2.0f ));
    
final Random r = new Random();
for ( int it = 0; it < itcount; ++it )
{
    int max = Integer.MIN_VALUE;
    
    // Randomize the array.
    for ( int i = 0; i < array.length; ++i )
    {
        array[i] = r.nextInt();
    }
    
    final long start = System.nanoTime();
    for ( int i = 0; i < array.length; ++i )
    {
        max = Math.max( array[i], max );
            // OpsMath.max() method implemented as described.
        // max = OpsMath.max( array[i], max );
    }
    // Calc time.
    final double end = (System.nanoTime() - start);
    // Store results.
    times.add( Double.valueOf( end ) );
    results.add( Integer.valueOf(  max ) );
}
// Print everything.
for ( int i = 0; i < itcount; ++i )
{
    System.out.println( "IT" + i + " result: " + results.get( i ) );
    System.out.println( "IT" + i + " millis: " + times.get( i ) * TO_MILLIS );
}

Java Math.max 結果：

IT0 result: 2147477409
IT0 millis: 9.636998
IT1 result: 2147483098
IT1 millis: 1.901314
IT2 result: 2147482877
IT2 millis: 2.095551
IT3 result: 2147483286
IT3 millis: 1.9232859999999998
IT4 result: 2147482828
IT4 millis: 1.9455179999999999
IT5 result: 2147482475
IT5 millis: 1.882047

OpsMath.max 結果：

IT0 result: 2147482689
IT0 millis: 9.003616
IT1 result: 2147483480
IT1 millis: 0.882421
IT2 result: 2147483186
IT2 millis: 1.079143
IT3 result: 2147478560
IT3 millis: 0.8861169999999999
IT4 result: 2147477851
IT4 millis: 0.916383
IT5 result: 2147481983
IT5 millis: 0.873984

總體結果還是一樣。 我嘗試過只對數組進行一次隨機化，並在同一個數組上重復測試，總體上我得到了更快的結果，但是 Java Math.max 和 OpsMath.max 之間的差異是相同的 2 倍。

Answer 1

很難說為什么Math.max比Ops.max慢，但很容易說明為什么這個基准強烈支持分支到條件移動：在第n次迭代中，

Math.max( array[i], max );

不等於max是array[n-1]大於所有先前元素的概率。 顯然，這個概率隨着n增加而變得越來越低，並且給定

final int[] array = new int[(8*1024*1024)/4];

大多數時候可以忽略不計。 條件移動指令對分支概率不敏感，它總是花費相同的時間來執行。 如果分支很難預測，則條件移動指令比分支預測更快。 另一方面，如果分支能夠以高概率被很好地預測，則分支預測會更快。 目前，與分支的最佳和最差情況相比，我不確定條件移動的速度。 ¹

在你的情況下，除了前幾個分支之外的所有分支都是相當可預測的。 從大約n == 10開始，使用條件移動是沒有意義的，因為可以保證正確預測分支並且可以與其他指令並行執行（我猜你每次迭代只需要一個周期）。

這似乎發生在計算最小值/最大值或進行一些低效排序的算法（良好的分支可預測性意味着每步低熵）。

¹條件移動和預測分支都需要一個周期。 前者的問題在於它需要兩個操作數，這需要額外的指令。 最后，當分支單元空閑時，關鍵路徑可能會變長和/或 ALU 飽和。 通常，但並非總是如此，在實際應用中可以很好地預測分支； 這就是為什么首先發明了分支預測。

至於時序條件移動與分支預測最佳和最壞情況的血腥細節，請參閱下面評論中的討論。 我自己的基准測試表明，當分支預測遇到最壞的情況時，條件移動明顯快於分支預測，但我不能忽略相互矛盾的結果。 我們需要對究竟是什么造成差異進行一些解釋。 更多的基准和/或分析可能會有所幫助。

Answer 2

當我在舊的 (1.6.0_27) JVM 上使用Math.max運行您的（適當修改的）代碼時，熱循環如下所示：

0x00007f4b65425c50: mov    %r11d,%edi         ;*getstatic array
                                              ; - foo146::bench@81 (line 40)
0x00007f4b65425c53: mov    0x10(%rax,%rdx,4),%r8d
0x00007f4b65425c58: mov    0x14(%rax,%rdx,4),%r10d
0x00007f4b65425c5d: mov    0x18(%rax,%rdx,4),%ecx
0x00007f4b65425c61: mov    0x2c(%rax,%rdx,4),%r11d
0x00007f4b65425c66: mov    0x28(%rax,%rdx,4),%r9d
0x00007f4b65425c6b: mov    0x24(%rax,%rdx,4),%ebx
0x00007f4b65425c6f: rex mov    0x20(%rax,%rdx,4),%esi
0x00007f4b65425c74: mov    0x1c(%rax,%rdx,4),%r14d  ;*iaload
                                              ; - foo146::bench@86 (line 40)
0x00007f4b65425c79: cmp    %edi,%r8d
0x00007f4b65425c7c: cmovl  %edi,%r8d
0x00007f4b65425c80: cmp    %r8d,%r10d
0x00007f4b65425c83: cmovl  %r8d,%r10d
0x00007f4b65425c87: cmp    %r10d,%ecx
0x00007f4b65425c8a: cmovl  %r10d,%ecx
0x00007f4b65425c8e: cmp    %ecx,%r14d
0x00007f4b65425c91: cmovl  %ecx,%r14d
0x00007f4b65425c95: cmp    %r14d,%esi
0x00007f4b65425c98: cmovl  %r14d,%esi
0x00007f4b65425c9c: cmp    %esi,%ebx
0x00007f4b65425c9e: cmovl  %esi,%ebx
0x00007f4b65425ca1: cmp    %ebx,%r9d
0x00007f4b65425ca4: cmovl  %ebx,%r9d
0x00007f4b65425ca8: cmp    %r9d,%r11d
0x00007f4b65425cab: cmovl  %r9d,%r11d         ;*invokestatic max
                                              ; - foo146::bench@88 (line 40)
0x00007f4b65425caf: add    $0x8,%edx          ;*iinc
                                              ; - foo146::bench@92 (line 39)
0x00007f4b65425cb2: cmp    $0x1ffff9,%edx
0x00007f4b65425cb8: jl     0x00007f4b65425c50

除了奇怪放置的 REX 前綴（不知道那是什么），這里有一個循環展開 8 次，它主要執行您期望的操作——加載、比較和條件移動。 有趣的是，如果將參數的順序交換為max ，它會在此處輸出另一種 8 深cmovl鏈。 我猜它不知道如何在循環完成后生成要合並的cmovl s 或 8 個單獨的cmovl鏈的 3 深樹。

使用顯式的OpsMath.max ，它變成了展開 8 次的條件和無條件分支的最大范圍。 我不會發布循環； 它不漂亮。 基本上，上面的每個mov/cmp/cmovl都被分解為加載、比較和條件跳轉到mov和jmp發生的位置。 有趣的是，如果您將參數的順序交換為max ，它會輸出一個 8 深的cmovle鏈。 編輯：正如@maaartinus 所指出的那樣，在某些機器上，分支最多的分支實際上更快，因為分支預測器對它們起作用，而這些分支是經過良好預測的分支。

我會猶豫從這個基准中得出結論。 你有基准建設問題； 你有很多次運行它比你，你必須因素代碼不同，如果你想一次熱點最快的代碼。 除了包裝器代碼之外，您並沒有衡量您的max有多快，或者 Hotspot 了解您正在嘗試做什么，或任何其他有價值的東西。 max兩種實現都會導致代碼對於任何類型的直接測量來說都太快了，在更大的程序上下文中沒有意義。

Answer 3

使用 JDK 8：

java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b132)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b70, mixed mode)

在 Ubuntu 13.10 上

我運行了以下內容：

import java.util.Random;
import java.util.function.BiFunction;

public class MaxPerformance {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array;

  public MaxPerformance(BiFunction<Integer, Integer, Integer> max, int[] array) {
    this.max = max;
    this.array = array;
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(m, array[i]);

    m = Integer.MIN_VALUE;
    for (int i = 0; i < array.length; ++i) m = max.apply(array[i], m);

    // total time over number of calls to max
    return ((double) (System.nanoTime() - start)) / (double) array.length / 2.0;
  }

  public double averageTime(int repeats) {
    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
      cumulativeTime += time();
    return (double) cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array = new int[size];
    for (int i = 0; i < size; i++) array[i] = random.nextInt();

    double tMath = new MaxPerformance(Math::max, array).averageTime(100);
    double tAlt1 = new MaxPerformance(MaxPerformance::max1, array).averageTime(100);
    double tAlt2 = new MaxPerformance(MaxPerformance::max2, array).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

我得到了以下結果（每次調用 max 所需的平均納秒數）：

Java Math: 15.443555810000003
Alt 1:     14.968298919999997
Alt 2:     16.442204045

所以從長遠來看，看起來第二個實現是最快的，盡管幅度相對較小。

為了進行更科學的測試，計算每個調用獨立於前一個調用的元素對的最大值是有意義的。 這可以通過使用兩個隨機數組而不是本基准測試中的一個來完成：

import java.util.Random;
import java.util.function.BiFunction;
public class MaxPerformance2 {
  private final BiFunction<Integer, Integer, Integer> max;
  private final int[] array1, array2;

  public MaxPerformance2(BiFunction<Integer, Integer, Integer> max, int[] array1, int[] array2) {
    this.max = max;
    this.array1 = array1;
    this.array2 = array2;
    if (array1.length != array2.length) throw new IllegalArgumentException();
  }

  public double time() {
    long start = System.nanoTime();

    int m = Integer.MIN_VALUE;
    for (int i = 0; i < array1.length; ++i) m = max.apply(array1[i], array2[i]);
    m += m; // to avoid optimizations!

    return ((double) (System.nanoTime() - start)) / (double) array1.length;
  }

  public double averageTime(int repeats) {
    // warm up rounds:
    double tmp = 0;
    for (int i = 0; i < 10; i++) tmp += time();
    tmp *= 2.0;

    double cumulativeTime = 0;
    for (int i = 0; i < repeats; i++)
        cumulativeTime += time();
    return cumulativeTime / (double) repeats;
  }

  public static void main(String[] args) {
    int size = 1000000;
    Random random = new Random(123123123L);
    int[] array1 = new int[size];
    int[] array2 = new int[size];
    for (int i = 0; i < size; i++) {
        array1[i] = random.nextInt();
        array2[i] = random.nextInt();
    }

    double tMath = new MaxPerformance2(Math::max, array1, array2).averageTime(100);
    double tAlt1 = new MaxPerformance2(MaxPerformance2::max1, array1, array2).averageTime(100);
    double tAlt2 = new MaxPerformance2(MaxPerformance2::max2, array1, array2).averageTime(100);

    System.out.println("Java Math: " + tMath);
    System.out.println("Alt 1:     " + tAlt1);
    System.out.println("Alt 2:     " + tAlt2);
  }

  public static int max1(final int a, final int b) {
    if (a >= b) return a;
    return b;
  }

  public static int max2(final int a, final int b) {
    return (a >= b) ? a : b; // same as JDK implementation
  }
}

這給了我：

Java Math: 15.346468170000005
Alt 1:     16.378737519999998
Alt 2:     20.506475350000006

您的測試設置方式會對結果產生巨大影響。 在這種情況下，JDK 版本似乎是最快的。 與之前的情況相比，這次的幅度相對較大。

有人提到了Caliper。 好吧，如果您閱讀wiki ，他們對微基准測試說的第一件事就是不要這樣做：這是因為一般來說很難獲得准確的結果。 我認為這是一個明顯的例子。

Java Math.min/max 性能

問題描述

3 個解決方案

解決方案1
12 已采納 2014-03-31 06:42:04

解決方案2
3 2014-03-31 04:30:50

解決方案3
1 2014-03-31 14:19:13

Java Math.min/max 性能

問題描述

3 個解決方案

解決方案1 12 已采納 2014-03-31 06:42:04

解決方案2 3 2014-03-31 04:30:50

解決方案3 1 2014-03-31 14:19:13

解決方案1
12 已采納 2014-03-31 06:42:04

解決方案2
3 2014-03-31 04:30:50

解決方案3
1 2014-03-31 14:19:13