为什么JVM在繁忙的旋转暂停后对同一代码块显示更多延迟？

Question

The code below demonstrates the problem unequivocally , which is: 下面的代码明确地说明了问题，即：

The exact same block of code becomes slower after a busy spin pause. 繁忙的旋转暂停后，完全相同的代码块变慢。

Note that of course I'm not using Thread.sleep . 请注意，我当然不使用Thread.sleep 。 Also note that there are no conditionals leading to a HotSpot/JIT de-optimization as I'm changing the pause using a math operation, not an IF . 另请注意，没有条件导致HotSpot / JIT去优化，因为我正在使用数学运算而不是IF来更改暂停。

There is a block of math operations that I want to time. 我想要一个数学运算块。
First I time the block pausing 1 nanosecond before I start my measurement. 首先，我在开始测量之前将块暂停1纳秒。 I do that 20,000 times. 我这样做了20,000次。
Then I change the pause from 1 nanosecond to 5 seconds and proceed to measure the latency as usual. 然后我将暂停从1纳秒更改为5秒，然后像往常一样继续测量延迟。 I do that 15 times. 我这样做了15次。
Then I print the last 30 measurements, so you can see 15 measurements with the pause of 1 nanosecond and 15 measurements with the pause of 5 seconds. 然后我打印最后30次测量，这样你可以看到15次测量，暂停1纳秒，15次测量，暂停5秒。

As you can see below, the discrepancy is big , especially in the very first measurement after the pause change. 正如您在下面所看到的， 差异很大 ，特别是在暂停变化后的第一次测量中。 Why is that!? 这是为什么！？

$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754

The code: 代码：

import java.util.Arrays;
import java.util.Date;
import java.util.Random;

public class JvmPauseLatency {

    private static final int WARMUP = 20000;
    private static final int EXTRA = 15;
    private static final long PAUSE = 5 * 1000000000L; // in nanos

    private final Random rand = new Random();
    private int count;
    private double calculation;
    private final long[] results = new long[WARMUP + EXTRA];
    private long interval = 1; // in nanos

    private long busyPause(long pauseInNanos) {
        final long start = System.nanoTime();
        long until = Long.MAX_VALUE;
        while(System.nanoTime() < until) {
           until = start + pauseInNanos;
        }
        return until;
    }

    public void run() {

        long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
        System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");

        while(count < results.length) {

            double x = busyPause(interval);

            long latency = System.nanoTime();

            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);

            latency = System.nanoTime() - latency;

            results[count++] = latency;
            interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
        }

        // now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
        System.out.println(new Date() + " => Calculation: " + calculation);
        System.out.println("Results:");
        long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
        for(long t: array) System.out.println(t);
    }

    public static void main(String[] args) {
        new JvmPauseLatency().run();
    }
}

Answer 1

TL;DR TL; DR

http://www.brendangregg.com/activebenchmarking.html http://www.brendangregg.com/activebenchmarking.html

casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C. 休闲基准：你基准A，但实际测量B，并得出结论你已经测量C.

Problem N1. 问题N1。 The very first measurement after the pause change. 暂停变化后的第一次测量。

It looks like you are faced with on-stack replacement . 看起来你正面临堆叠更换。 When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame. 当OSR发生时，VM暂停，目标函数的堆栈帧被等效帧替换。

The root case is wrong microbenchmark - it was not properly warmed up. 根案例是错误的microbenchmark - 它没有得到适当的预热。 Just insert the following line into your benchmark before while loop in order to fix it: 只需在while循环之前将以下行插入基准测试中即可修复它：

System.out.println("WARMUP = " + busyPause(5000000000L));

How to check this - just run your benchmark with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls . 如何检查 - 只需使用-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls运行基准测试-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls 。 I've modified your code - now it prints interval into system output before every call: 我修改了你的代码 - 现在它在每次调用之前将间隔打印到系统输出中：

interval = 1
interval = 1
interval = 5000000000
    689  145       4       JvmPauseLatency::busyPause (19 bytes)   made not entrant
    689  146       3       JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J 
    698  147 %     4       JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
    702  148       4       JvmPauseLatency::busyPause (19 bytes)
    705  146       3       JvmPauseLatency::busyPause (19 bytes)   made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J 
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000

Usually OSR occurs at tier 4 so in order to disable it you can use the following options: 通常OSR发生在第4层，因此为了禁用它，您可以使用以下选项：

-XX:-TieredCompilation disable tiered compilation -XX:-TieredCompilation禁用分层编译
-XX:-TieredCompilation -XX:TieredStopAtLevel=3 disable tiered compilation to 4 level -XX:-TieredCompilation -XX:TieredStopAtLevel=3禁用分层编译到4级
-XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement disable OSR -XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement禁用OSR

Problem N2. 问题N2。 How to measure. 如何测量。

Let's start from the article https://shipilev.net/blog/2014/nanotrusting-nanotime . 让我们从文章https://shipilev.net/blog/2014/nanotrusting-nanotime开始。 In few words: 简而言之：

JIT can compile only method - in your test you have one loop, so only OSR is available for your test JIT只能编译方法 - 在你的测试中你有一个循环，所以只有OSR可用于你的测试
you are trying to measure something small, maybe smaller than nanoTime() call(see What is the cost of volatile write? ) 你试图测量一些小的，可能小于nanoTime()调用（看看volatile写入的成本是多少？）
microarchitecture level – caches, CPU pipeline stalls are important, for example, TLB miss or branch misprediction take more time than the test execution time 微体系结构级别 - 缓存，CPU管道停顿很重要，例如，TLB未命中或分支错误预测比测试执行时间花费更多时间

So in order to avoid all these pitfalls you can use JMH based benchmark like this: 因此，为了避免所有这些陷阱，您可以使用基于JMH的基准测试，如下所示：

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {

    public static final long LONG_PAUSE = 5000L;
    public static final long SHORT_PAUSE = 1L;
    public Random rand;

    @Setup
    public void initI() {
        rand = new Random(0xDEAD_BEEF);
    }

    private long busyPause(long pauseInNanos) {
        Blackhole.consumeCPU(pauseInNanos);
        return pauseInNanos;
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseShort() {
        return busyPause(SHORT_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseLong() {
        return busyPause(LONG_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyFunc() {
        return doCalculation(1);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureShort() {
        long x = busyPause(SHORT_PAUSE);
        return doCalculation(x);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureLong() {
        long x = busyPause(LONG_PAUSE);
        return doCalculation(x);
    }

    private long doCalculation(long x) {
        long calculation = 0;
        calculation += x / (rand.nextInt(5) + 1);
        calculation -= calculation / (rand.nextInt(5) + 1);
        calculation -= x / (rand.nextInt(6) + 1);
        calculation += calculation / (rand.nextInt(6) + 1);
        return calculation;
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(LatencyTest.class.getName())
                .verbosity(VerboseMode.NORMAL)
                .build();
        new Runner(options).run();
    }
}

Please note that I've changed busy loop implementation to Blackhole#consumeCPU() in order to avoid OS related effects. 请注意，我已将忙循环实现更改为Blackhole＃consumeCPU（）以避免与操作系统相关的影响。 So my results are: 所以我的结果是：

Benchmark                          Mode  Cnt      Score     Error  Units
LatencyTest.latencyBusyPauseLong   avgt    4  15992.216 ± 106.538  ns/op
LatencyTest.latencyBusyPauseShort  avgt    4      6.450 ±   0.163  ns/op
LatencyTest.latencyFunc            avgt    4     97.321 ±   0.984  ns/op
LatencyTest.measureLong            avgt    4  16103.228 ± 102.338  ns/op
LatencyTest.measureShort           avgt    4    100.454 ±   0.041  ns/op

Please note that the results are almost additive, ie latencyFunc + latencyBusyPauseShort = measureShort 请注意，结果几乎是加法的，即latencyFunc + latencyBusyPauseShort = measureShort

Problem N3. 问题N3。 The discrepancy is big. 差异很大。

What is wrong with your test? 你的测试有什么问题？ It does not warm-up JVM properly, ie it uses one parameter to warm-up and another to test. 它没有正确预热JVM，即它使用一个参数进行预热而另一个参数进行测试。 Why is this important? 为什么这很重要？ JVM uses profile-guided optimizations, for example, it counts how often a branch has been taken and generates "the best"(branch-free) code for the particular profile. JVM使用配置文件引导的优化，例如，它计算分支的使用频率，并为特定配置文件生成“最佳”（无分支）代码。 So then we are trying to warm-up JVM our benchmark with parameter 1, JVM generates "optimal code" where branch in while loop has been never taken. 那么我们试图用参数1来预热JVM我们的基准，JVM生成“最佳代码”，其中从未采用while循环中的分支。 Here is an event from JIT compilation log( -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation ): 这是来自JIT编译日志的事件（ -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation ）：

<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>

After property change JIT uses uncommon trap in order to process your code which is not optimal. 属性更改后，JIT使用不常见的陷阱来处理不是最佳的代码。 I've created a benchmark which is based on original one with minor changes: 我已经创建了一个基于原始基准的基准测试，只有很小的变化：

busyPause replaced by consumeCPU from JMH in order to have pure java benchmark without interactions with system(actually nano time uses userland function vdso clock_gettime and we unable to profile this code) busyPause被来自JMH的consumeCPU取代，以便在没有与系统交互的情况下获得纯Java基准（实际上nano时间使用userland函数vdso clock_gettime ，我们无法分析此代码）
all calculations are removed 所有计算都被删除

_ _

import java.util.Arrays;

public class JvmPauseLatency {

    private static final int WARMUP = 2000 ;
    private static final int EXTRA = 10;
    private static final long PAUSE = 70000L; // in nanos
    private static volatile long consumedCPU = System.nanoTime();

    //org.openjdk.jmh.infra.Blackhole.consumeCPU()
    private static void consumeCPU(long tokens) {
        long t = consumedCPU;
        for (long i = tokens; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        if (t == 42) {
            consumedCPU += t;
        }
    }

    public void run(long warmPause) {
        long[] results = new long[WARMUP + EXTRA];
        int count = 0;
        long interval = warmPause;
        while(count < results.length) {

            consumeCPU(interval);

            long latency = System.nanoTime();
            latency = System.nanoTime() - latency;

            results[count++] = latency;
            if (count == WARMUP) {
                interval = PAUSE;
            }
        }

        System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
    }

    public static void main(String[] args) {
        int totalCount = 0;
        while (totalCount < 100) {
            new JvmPauseLatency().run(0);
            totalCount ++;
        }
    }
}

And results are 结果是

Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]

In order to fix this benchmark just replace new JvmPauseLatency().run(0) with new JvmPauseLatency().run(PAUSE); 为了修复这个基准，只需用new JvmPauseLatency().run(0)替换new JvmPauseLatency().run(0) new JvmPauseLatency().run(PAUSE); and here is the results: 这是结果：

Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]

If you want to change "pause" dynamically - you have to warm-up JVM dynamically, ie 如果你想动态地改变“暂停” - 你必须动态地预热JVM，即

    while(count < results.length) {

        consumeCPU(interval);

        long latency = System.nanoTime();
        latency = System.nanoTime() - latency;

        results[count++] = latency;
        if (count >= WARMUP) {
            interval = PAUSE;
        } else {
            interval =  rnd.nextBoolean() ? PAUSE : 0;
        }
    }

Problem N4. 问题N4。 What about interpreter -Xint? 那么翻译-Xint怎么样？

In case of switch-based interpreter we have a lot of problems and the main is indirect branch instructions. 在基于交换机的解释器的情况下，我们有很多问题，主要是间接分支指令。 I've made 3 experiments: 我做了3个实验：

random warmup 随机预热
constant warmup with 0 pause 持续预热0暂停
the whole test uses pause 0 including 整个测试使用暂停0包括

Every experiment was started by the following command sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency and the results are: 每个实验都是通过以下命令启动sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency ，结果如下：

 Performance counter stats for 'java -Xint JvmPauseLatency':

   272,822,274,275      cycles                                                      
   723,420,125,590      instructions              #    2.65  insn per cycle         
        26,994,494      cache-references                                            
         8,575,746      cache-misses              #   31.769 % of all cache refs    
     2,060,138,555      bus-cycles                                                  
         2,930,155      branch-misses                                               

      86.808481183 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

     2,812,949,238      cycles                                                      
     7,267,497,946      instructions              #    2.58  insn per cycle         
         6,936,666      cache-references                                            
         1,107,318      cache-misses              #   15.963 % of all cache refs    
        21,410,797      bus-cycles                                                  
           791,441      branch-misses                                               

       0.907758181 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

       126,157,793      cycles                                                      
       158,845,300      instructions              #    1.26  insn per cycle         
         6,650,471      cache-references                                            
           909,593      cache-misses              #   13.677 % of all cache refs    
         1,635,548      bus-cycles                                                  
           775,564      branch-misses                                               

       0.073511817 seconds time elapsed

In case of branch miss latency and footprint grows non-linearly due to huge memory footprint. 如果分支未命中延迟和占用空间由于巨大的内存占用而非线性增长。

Answer 2

You can probably not rely on the precision of any timer for the accuracy you seem to want, https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime-- states that 您可能不会依赖任何计时器的精度来达到您想要的准确度， https： //docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--表明

This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis(). 此方法提供纳秒级精度，但不一定是纳秒级分辨率（即，值的变化频率） - 除了分辨率至少与currentTimeMillis（）的分辨率一样好之外，不做任何保证。

为什么JVM在繁忙的旋转暂停后对同一代码块显示更多延迟？

问题描述

2 个解决方案

解决方案1
10 2017-04-29 18:30:15

TL;DR TL; DR

Problem N1. 问题N1。 The very first measurement after the pause change. 暂停变化后的第一次测量。

Problem N2. 问题N2。 How to measure. 如何测量。

Problem N3. 问题N3。 The discrepancy is big. 差异很大。

Problem N4. 问题N4。 What about interpreter -Xint? 那么翻译-Xint怎么样？

解决方案2
-1 2017-05-23 17:59:06

为什么JVM在繁忙的旋转暂停后对同一代码块显示更多延迟？

问题描述

2 个解决方案

解决方案1 10 2017-04-29 18:30:15

TL;DR TL; DR

Problem N1. 问题N1。 The very first measurement after the pause change. 暂停变化后的第一次测量。

Problem N2. 问题N2。 How to measure. 如何测量。

Problem N3. 问题N3。 The discrepancy is big. 差异很大。

Problem N4. 问题N4。 What about interpreter -Xint? 那么翻译-Xint怎么样？

解决方案2 -1 2017-05-23 17:59:06

解决方案1
10 2017-04-29 18:30:15

解决方案2
-1 2017-05-23 17:59:06