為什么JVM在繁忙的旋轉暫停后對同一代碼塊顯示更多延遲？

Question

下面的代碼明確地說明了問題，即：

繁忙的旋轉暫停后，完全相同的代碼塊變慢。

請注意，我當然不使用Thread.sleep 。 另請注意，沒有條件導致HotSpot / JIT去優化，因為我正在使用數學運算而不是IF來更改暫停。

我想要一個數學運算塊。
首先，我在開始測量之前將塊暫停1納秒。 我這樣做了20,000次。
然后我將暫停從1納秒更改為5秒，然后像往常一樣繼續測量延遲。 我這樣做了15次。
然后我打印最后30次測量，這樣你可以看到15次測量，暫停1納秒，15次測量，暫停5秒。

正如您在下面所看到的， 差異很大 ，特別是在暫停變化后的第一次測量中。 這是為什么！？

$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754

代碼：

import java.util.Arrays;
import java.util.Date;
import java.util.Random;

public class JvmPauseLatency {

    private static final int WARMUP = 20000;
    private static final int EXTRA = 15;
    private static final long PAUSE = 5 * 1000000000L; // in nanos

    private final Random rand = new Random();
    private int count;
    private double calculation;
    private final long[] results = new long[WARMUP + EXTRA];
    private long interval = 1; // in nanos

    private long busyPause(long pauseInNanos) {
        final long start = System.nanoTime();
        long until = Long.MAX_VALUE;
        while(System.nanoTime() < until) {
           until = start + pauseInNanos;
        }
        return until;
    }

    public void run() {

        long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
        System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");

        while(count < results.length) {

            double x = busyPause(interval);

            long latency = System.nanoTime();

            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);

            latency = System.nanoTime() - latency;

            results[count++] = latency;
            interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
        }

        // now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
        System.out.println(new Date() + " => Calculation: " + calculation);
        System.out.println("Results:");
        long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
        for(long t: array) System.out.println(t);
    }

    public static void main(String[] args) {
        new JvmPauseLatency().run();
    }
}

Answer 1

TL; DR

http://www.brendangregg.com/activebenchmarking.html

休閑基准：你基准A，但實際測量B，並得出結論你已經測量C.

問題N1。暫停變化后的第一次測量。

看起來你正面臨堆疊更換。 當OSR發生時，VM暫停，目標函數的堆棧幀被等效幀替換。

根案例是錯誤的microbenchmark - 它沒有得到適當的預熱。 只需在while循環之前將以下行插入基准測試中即可修復它：

System.out.println("WARMUP = " + busyPause(5000000000L));

如何檢查 - 只需使用-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls運行基准測試-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls 。 我修改了你的代碼 - 現在它在每次調用之前將間隔打印到系統輸出中：

interval = 1
interval = 1
interval = 5000000000
    689  145       4       JvmPauseLatency::busyPause (19 bytes)   made not entrant
    689  146       3       JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J 
    698  147 %     4       JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
    702  148       4       JvmPauseLatency::busyPause (19 bytes)
    705  146       3       JvmPauseLatency::busyPause (19 bytes)   made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J 
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000

通常OSR發生在第4層，因此為了禁用它，您可以使用以下選項：

-XX:-TieredCompilation禁用分層編譯
-XX:-TieredCompilation -XX:TieredStopAtLevel=3禁用分層編譯到4級
-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement禁用OSR

問題N2。如何測量。

讓我們從文章https://shipilev.net/blog/2014/nanotrusting-nanotime開始。 簡而言之：

JIT只能編譯方法 - 在你的測試中你有一個循環，所以只有OSR可用於你的測試
你試圖測量一些小的，可能小於nanoTime()調用（看看volatile寫入的成本是多少？）
微體系結構級別 - 緩存，CPU管道停頓很重要，例如，TLB未命中或分支錯誤預測比測試執行時間花費更多時間

因此，為了避免所有這些陷阱，您可以使用基於JMH的基准測試，如下所示：

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {

    public static final long LONG_PAUSE = 5000L;
    public static final long SHORT_PAUSE = 1L;
    public Random rand;

    @Setup
    public void initI() {
        rand = new Random(0xDEAD_BEEF);
    }

    private long busyPause(long pauseInNanos) {
        Blackhole.consumeCPU(pauseInNanos);
        return pauseInNanos;
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseShort() {
        return busyPause(SHORT_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseLong() {
        return busyPause(LONG_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyFunc() {
        return doCalculation(1);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureShort() {
        long x = busyPause(SHORT_PAUSE);
        return doCalculation(x);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureLong() {
        long x = busyPause(LONG_PAUSE);
        return doCalculation(x);
    }

    private long doCalculation(long x) {
        long calculation = 0;
        calculation += x / (rand.nextInt(5) + 1);
        calculation -= calculation / (rand.nextInt(5) + 1);
        calculation -= x / (rand.nextInt(6) + 1);
        calculation += calculation / (rand.nextInt(6) + 1);
        return calculation;
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(LatencyTest.class.getName())
                .verbosity(VerboseMode.NORMAL)
                .build();
        new Runner(options).run();
    }
}

請注意，我已將忙循環實現更改為Blackhole＃consumeCPU（）以避免與操作系統相關的影響。 所以我的結果是：

Benchmark                          Mode  Cnt      Score     Error  Units
LatencyTest.latencyBusyPauseLong   avgt    4  15992.216 ± 106.538  ns/op
LatencyTest.latencyBusyPauseShort  avgt    4      6.450 ±   0.163  ns/op
LatencyTest.latencyFunc            avgt    4     97.321 ±   0.984  ns/op
LatencyTest.measureLong            avgt    4  16103.228 ± 102.338  ns/op
LatencyTest.measureShort           avgt    4    100.454 ±   0.041  ns/op

請注意，結果幾乎是加法的，即latencyFunc + latencyBusyPauseShort = measureShort

問題N3。差異很大。

你的測試有什么問題？ 它沒有正確預熱JVM，即它使用一個參數進行預熱而另一個參數進行測試。 為什么這很重要？ JVM使用配置文件引導的優化，例如，它計算分支的使用頻率，並為特定配置文件生成“最佳”（無分支）代碼。 那么我們試圖用參數1來預熱JVM我們的基准，JVM生成“最佳代碼”，其中從未采用while循環中的分支。 這是來自JIT編譯日志的事件（ -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation ）：

<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>

屬性更改后，JIT使用不常見的陷阱來處理不是最佳的代碼。 我已經創建了一個基於原始基准的基准測試，只有很小的變化：

busyPause被來自JMH的consumeCPU取代，以便在沒有與系統交互的情況下獲得純Java基准（實際上nano時間使用userland函數vdso clock_gettime ，我們無法分析此代碼）
所有計算都被刪除

_

import java.util.Arrays;

public class JvmPauseLatency {

    private static final int WARMUP = 2000 ;
    private static final int EXTRA = 10;
    private static final long PAUSE = 70000L; // in nanos
    private static volatile long consumedCPU = System.nanoTime();

    //org.openjdk.jmh.infra.Blackhole.consumeCPU()
    private static void consumeCPU(long tokens) {
        long t = consumedCPU;
        for (long i = tokens; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        if (t == 42) {
            consumedCPU += t;
        }
    }

    public void run(long warmPause) {
        long[] results = new long[WARMUP + EXTRA];
        int count = 0;
        long interval = warmPause;
        while(count < results.length) {

            consumeCPU(interval);

            long latency = System.nanoTime();
            latency = System.nanoTime() - latency;

            results[count++] = latency;
            if (count == WARMUP) {
                interval = PAUSE;
            }
        }

        System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
    }

    public static void main(String[] args) {
        int totalCount = 0;
        while (totalCount < 100) {
            new JvmPauseLatency().run(0);
            totalCount ++;
        }
    }
}

結果是

Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]

為了修復這個基准，只需用new JvmPauseLatency().run(0)替換new JvmPauseLatency().run(0) new JvmPauseLatency().run(PAUSE); 這是結果：

Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]

如果你想動態地改變“暫停” - 你必須動態地預熱JVM，即

    while(count < results.length) {

        consumeCPU(interval);

        long latency = System.nanoTime();
        latency = System.nanoTime() - latency;

        results[count++] = latency;
        if (count >= WARMUP) {
            interval = PAUSE;
        } else {
            interval =  rnd.nextBoolean() ? PAUSE : 0;
        }
    }

問題N4。那么翻譯-Xint怎么樣？

在基於交換機的解釋器的情況下，我們有很多問題，主要是間接分支指令。 我做了3個實驗：

隨機預熱
持續預熱0暫停
整個測試使用暫停0包括

每個實驗都是通過以下命令啟動sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency ，結果如下：

 Performance counter stats for 'java -Xint JvmPauseLatency':

   272,822,274,275      cycles                                                      
   723,420,125,590      instructions              #    2.65  insn per cycle         
        26,994,494      cache-references                                            
         8,575,746      cache-misses              #   31.769 % of all cache refs    
     2,060,138,555      bus-cycles                                                  
         2,930,155      branch-misses                                               

      86.808481183 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

     2,812,949,238      cycles                                                      
     7,267,497,946      instructions              #    2.58  insn per cycle         
         6,936,666      cache-references                                            
         1,107,318      cache-misses              #   15.963 % of all cache refs    
        21,410,797      bus-cycles                                                  
           791,441      branch-misses                                               

       0.907758181 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

       126,157,793      cycles                                                      
       158,845,300      instructions              #    1.26  insn per cycle         
         6,650,471      cache-references                                            
           909,593      cache-misses              #   13.677 % of all cache refs    
         1,635,548      bus-cycles                                                  
           775,564      branch-misses                                               

       0.073511817 seconds time elapsed

如果分支未命中延遲和占用空間由於巨大的內存占用而非線性增長。

Answer 2

您可能不會依賴任何計時器的精度來達到您想要的准確度， https： //docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--表明

此方法提供納秒級精度，但不一定是納秒級分辨率（即，值的變化頻率） - 除了分辨率至少與currentTimeMillis（）的分辨率一樣好之外，不做任何保證。

為什么JVM在繁忙的旋轉暫停后對同一代碼塊顯示更多延遲？

問題描述

2 個解決方案

解決方案1
10 2017-04-29 18:30:15

TL; DR

問題N1。暫停變化后的第一次測量。

問題N2。如何測量。

問題N3。差異很大。

問題N4。那么翻譯-Xint怎么樣？

解決方案2
-1 2017-05-23 17:59:06

為什么JVM在繁忙的旋轉暫停后對同一代碼塊顯示更多延遲？

問題描述

2 個解決方案

解決方案1 10 2017-04-29 18:30:15

TL; DR

問題N1。 暫停變化后的第一次測量。

問題N2。 如何測量。

問題N3。 差異很大。

問題N4。 那么翻譯-Xint怎么樣？

解決方案2 -1 2017-05-23 17:59:06

解決方案1
10 2017-04-29 18:30:15

問題N1。暫停變化后的第一次測量。

問題N2。如何測量。

問題N3。差異很大。

問題N4。那么翻譯-Xint怎么樣？

解決方案2
-1 2017-05-23 17:59:06