简体   繁体   English

为什么JVM在繁忙的旋转暂停后对同一代码块显示更多延迟?

[英]Why does the JVM show more latency for the same block of code after a busy spin pause?

The code below demonstrates the problem unequivocally , which is: 下面的代码明确地说明了问题,即:

The exact same block of code becomes slower after a busy spin pause. 繁忙的旋转暂停后,完全相同的代码块变慢。

Note that of course I'm not using Thread.sleep . 请注意,我当然不使用Thread.sleep Also note that there are no conditionals leading to a HotSpot/JIT de-optimization as I'm changing the pause using a math operation, not an IF . 另请注意,没有条件导致HotSpot / JIT去优化,因为我正在使用数学运算而不是IF来更改暂停。

  • There is a block of math operations that I want to time. 我想要一个数学运算块。
  • First I time the block pausing 1 nanosecond before I start my measurement. 首先,我在开始测量之前将块暂停1纳秒。 I do that 20,000 times. 我这样做了20,000次。
  • Then I change the pause from 1 nanosecond to 5 seconds and proceed to measure the latency as usual. 然后我将暂停从1纳秒更改为5秒,然后像往常一样继续测量延迟。 I do that 15 times. 我这样做了15次。
  • Then I print the last 30 measurements, so you can see 15 measurements with the pause of 1 nanosecond and 15 measurements with the pause of 5 seconds. 然后我打印最后30次测量,这样你可以看到15次测量,暂停1纳秒,15次测量,暂停5秒。

As you can see below, the discrepancy is big , especially in the very first measurement after the pause change. 正如您在下面所看到的, 差异很大 ,特别是在暂停变化后的第一次测量中。 Why is that!? 这是为什么!?

$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754

The code: 代码:

import java.util.Arrays;
import java.util.Date;
import java.util.Random;

public class JvmPauseLatency {

    private static final int WARMUP = 20000;
    private static final int EXTRA = 15;
    private static final long PAUSE = 5 * 1000000000L; // in nanos

    private final Random rand = new Random();
    private int count;
    private double calculation;
    private final long[] results = new long[WARMUP + EXTRA];
    private long interval = 1; // in nanos

    private long busyPause(long pauseInNanos) {
        final long start = System.nanoTime();
        long until = Long.MAX_VALUE;
        while(System.nanoTime() < until) {
           until = start + pauseInNanos;
        }
        return until;
    }

    public void run() {

        long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
        System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");

        while(count < results.length) {

            double x = busyPause(interval);

            long latency = System.nanoTime();

            calculation += x / (rand.nextInt(5) + 1);
            calculation -= calculation / (rand.nextInt(5) + 1);
            calculation -= x / (rand.nextInt(6) + 1);
            calculation += calculation / (rand.nextInt(6) + 1);

            latency = System.nanoTime() - latency;

            results[count++] = latency;
            interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
        }

        // now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
        System.out.println(new Date() + " => Calculation: " + calculation);
        System.out.println("Results:");
        long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
        for(long t: array) System.out.println(t);
    }

    public static void main(String[] args) {
        new JvmPauseLatency().run();
    }
}

TL;DR TL; DR

http://www.brendangregg.com/activebenchmarking.html http://www.brendangregg.com/activebenchmarking.html

casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C. 休闲基准:你基准A,但实际测量B,并得出结论你已经测量C.

Problem N1. 问题N1。 The very first measurement after the pause change. 暂停变化后的第一次测量。

It looks like you are faced with on-stack replacement . 看起来你正面临堆叠更换 When OSR occurs, the VM is paused, and the stack frame for the target function is replaced by an equivalent frame. 当OSR发生时,VM暂停,目标函数的堆栈帧被等效帧替换。

The root case is wrong microbenchmark - it was not properly warmed up. 根案例是错误的microbenchmark - 它没有得到适当的预热。 Just insert the following line into your benchmark before while loop in order to fix it: 只需在while循环之前将以下行插入基准测试中即可修复它:

System.out.println("WARMUP = " + busyPause(5000000000L));

How to check this - just run your benchmark with -XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls . 如何检查 - 只需使用-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls运行基准测试-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls I've modified your code - now it prints interval into system output before every call: 我修改了你的代码 - 现在它在每次调用之前将间隔打印到系统输出中:

interval = 1
interval = 1
interval = 5000000000
    689  145       4       JvmPauseLatency::busyPause (19 bytes)   made not entrant
    689  146       3       JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J 
    698  147 %     4       JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
    702  148       4       JvmPauseLatency::busyPause (19 bytes)
    705  146       3       JvmPauseLatency::busyPause (19 bytes)   made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J 
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000

Usually OSR occurs at tier 4 so in order to disable it you can use the following options: 通常OSR发生在第4层,因此为了禁用它,您可以使用以下选项:

  • -XX:-TieredCompilation disable tiered compilation -XX:-TieredCompilation禁用分层编译
  • -XX:-TieredCompilation -XX:TieredStopAtLevel=3 disable tiered compilation to 4 level -XX:-TieredCompilation -XX:TieredStopAtLevel=3禁用分层编译到4级
  • -XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement disable OSR -XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement禁用OSR

Problem N2. 问题N2。 How to measure. 如何测量。

Let's start from the article https://shipilev.net/blog/2014/nanotrusting-nanotime . 让我们从文章https://shipilev.net/blog/2014/nanotrusting-nanotime开始。 In few words: 简而言之:

  • JIT can compile only method - in your test you have one loop, so only OSR is available for your test JIT只能编译方法 - 在你的测试中你有一个循环,所以只有OSR可用于你的测试
  • you are trying to measure something small, maybe smaller than nanoTime() call(see What is the cost of volatile write? ) 你试图测量一些小的,可能小于nanoTime()调用(看看volatile写入的成本是多少?
  • microarchitecture level – caches, CPU pipeline stalls are important, for example, TLB miss or branch misprediction take more time than the test execution time 微体系结构级别 - 缓存,CPU管道停顿很重要,例如,TL​​B未命中或分支错误预测比测试执行时间花费更多时间

So in order to avoid all these pitfalls you can use JMH based benchmark like this: 因此,为了避免所有这些陷阱,您可以使用基于JMH的基准测试,如下所示:

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;

import java.util.Random;
import java.util.concurrent.TimeUnit;

@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {

    public static final long LONG_PAUSE = 5000L;
    public static final long SHORT_PAUSE = 1L;
    public Random rand;

    @Setup
    public void initI() {
        rand = new Random(0xDEAD_BEEF);
    }

    private long busyPause(long pauseInNanos) {
        Blackhole.consumeCPU(pauseInNanos);
        return pauseInNanos;
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseShort() {
        return busyPause(SHORT_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyBusyPauseLong() {
        return busyPause(LONG_PAUSE);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long latencyFunc() {
        return doCalculation(1);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureShort() {
        long x = busyPause(SHORT_PAUSE);
        return doCalculation(x);
    }

    @Benchmark
    @BenchmarkMode({Mode.AverageTime})
    public long measureLong() {
        long x = busyPause(LONG_PAUSE);
        return doCalculation(x);
    }

    private long doCalculation(long x) {
        long calculation = 0;
        calculation += x / (rand.nextInt(5) + 1);
        calculation -= calculation / (rand.nextInt(5) + 1);
        calculation -= x / (rand.nextInt(6) + 1);
        calculation += calculation / (rand.nextInt(6) + 1);
        return calculation;
    }

    public static void main(String[] args) throws RunnerException {
        Options options = new OptionsBuilder()
                .include(LatencyTest.class.getName())
                .verbosity(VerboseMode.NORMAL)
                .build();
        new Runner(options).run();
    }
}

Please note that I've changed busy loop implementation to Blackhole#consumeCPU() in order to avoid OS related effects. 请注意,我已将忙循环实现更改为Blackhole#consumeCPU()以避免与操作系统相关的影响。 So my results are: 所以我的结果是:

Benchmark                          Mode  Cnt      Score     Error  Units
LatencyTest.latencyBusyPauseLong   avgt    4  15992.216 ± 106.538  ns/op
LatencyTest.latencyBusyPauseShort  avgt    4      6.450 ±   0.163  ns/op
LatencyTest.latencyFunc            avgt    4     97.321 ±   0.984  ns/op
LatencyTest.measureLong            avgt    4  16103.228 ± 102.338  ns/op
LatencyTest.measureShort           avgt    4    100.454 ±   0.041  ns/op

Please note that the results are almost additive, ie latencyFunc + latencyBusyPauseShort = measureShort 请注意,结果几乎是加法的,即latencyFunc + latencyBusyPauseShort = measureShort

Problem N3. 问题N3。 The discrepancy is big. 差异很大。

What is wrong with your test? 你的测试有什么问题? It does not warm-up JVM properly, ie it uses one parameter to warm-up and another to test. 它没有正确预热JVM,即它使用一个参数进行预热而另一个参数进行测试。 Why is this important? 为什么这很重要? JVM uses profile-guided optimizations, for example, it counts how often a branch has been taken and generates "the best"(branch-free) code for the particular profile. JVM使用配置文件引导的优化,例如,它计算分支的使用频率,并为特定配置文件生成“最佳”(无分支)代码。 So then we are trying to warm-up JVM our benchmark with parameter 1, JVM generates "optimal code" where branch in while loop has been never taken. 那么我们试图用参数1来预热JVM我们的基准,JVM生成“最佳代码”,其中从未采用while循环中的分支。 Here is an event from JIT compilation log( -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation ): 这是来自JIT编译日志的事件( -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation ):

<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/> 

After property change JIT uses uncommon trap in order to process your code which is not optimal. 属性更改后,JIT使用不常见的陷阱来处理不是最佳的代码。 I've created a benchmark which is based on original one with minor changes: 我已经创建了一个基于原始基准的基准测试,只有很小的变化:

  • busyPause replaced by consumeCPU from JMH in order to have pure java benchmark without interactions with system(actually nano time uses userland function vdso clock_gettime and we unable to profile this code) busyPause被来自JMH的consumeCPU取代,以便在没有与系统交互的情况下获得纯Java基准(实际上nano时间使用userland函数vdso clock_gettime ,我们无法分析此代码)
  • all calculations are removed 所有计算都被删除

_ _

import java.util.Arrays;

public class JvmPauseLatency {

    private static final int WARMUP = 2000 ;
    private static final int EXTRA = 10;
    private static final long PAUSE = 70000L; // in nanos
    private static volatile long consumedCPU = System.nanoTime();

    //org.openjdk.jmh.infra.Blackhole.consumeCPU()
    private static void consumeCPU(long tokens) {
        long t = consumedCPU;
        for (long i = tokens; i > 0; i--) {
            t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
        }
        if (t == 42) {
            consumedCPU += t;
        }
    }

    public void run(long warmPause) {
        long[] results = new long[WARMUP + EXTRA];
        int count = 0;
        long interval = warmPause;
        while(count < results.length) {

            consumeCPU(interval);

            long latency = System.nanoTime();
            latency = System.nanoTime() - latency;

            results[count++] = latency;
            if (count == WARMUP) {
                interval = PAUSE;
            }
        }

        System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
    }

    public static void main(String[] args) {
        int totalCount = 0;
        while (totalCount < 100) {
            new JvmPauseLatency().run(0);
            totalCount ++;
        }
    }
}

And results are 结果是

Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]

In order to fix this benchmark just replace new JvmPauseLatency().run(0) with new JvmPauseLatency().run(PAUSE); 为了修复这个基准,只需用new JvmPauseLatency().run(0)替换new JvmPauseLatency().run(0) new JvmPauseLatency().run(PAUSE); and here is the results: 这是结果:

Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]

If you want to change "pause" dynamically - you have to warm-up JVM dynamically, ie 如果你想动态地改变“暂停” - 你必须动态地预热JVM,即

    while(count < results.length) {

        consumeCPU(interval);

        long latency = System.nanoTime();
        latency = System.nanoTime() - latency;

        results[count++] = latency;
        if (count >= WARMUP) {
            interval = PAUSE;
        } else {
            interval =  rnd.nextBoolean() ? PAUSE : 0;
        }
    }

Problem N4. 问题N4。 What about interpreter -Xint? 那么翻译-Xint怎么样?

In case of switch-based interpreter we have a lot of problems and the main is indirect branch instructions. 在基于交换机的解释器的情况下,我们有很多问题,主要是间接分支指令。 I've made 3 experiments: 我做了3个实验:

  1. random warmup 随机预热
  2. constant warmup with 0 pause 持续预热0暂停
  3. the whole test uses pause 0 including 整个测试使用暂停0包括

Every experiment was started by the following command sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency and the results are: 每个实验都是通过以下命令启动sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency ,结果如下:

 Performance counter stats for 'java -Xint JvmPauseLatency':

   272,822,274,275      cycles                                                      
   723,420,125,590      instructions              #    2.65  insn per cycle         
        26,994,494      cache-references                                            
         8,575,746      cache-misses              #   31.769 % of all cache refs    
     2,060,138,555      bus-cycles                                                  
         2,930,155      branch-misses                                               

      86.808481183 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

     2,812,949,238      cycles                                                      
     7,267,497,946      instructions              #    2.58  insn per cycle         
         6,936,666      cache-references                                            
         1,107,318      cache-misses              #   15.963 % of all cache refs    
        21,410,797      bus-cycles                                                  
           791,441      branch-misses                                               

       0.907758181 seconds time elapsed

 Performance counter stats for 'java -Xint JvmPauseLatency':

       126,157,793      cycles                                                      
       158,845,300      instructions              #    1.26  insn per cycle         
         6,650,471      cache-references                                            
           909,593      cache-misses              #   13.677 % of all cache refs    
         1,635,548      bus-cycles                                                  
           775,564      branch-misses                                               

       0.073511817 seconds time elapsed

In case of branch miss latency and footprint grows non-linearly due to huge memory footprint. 如果分支未命中延迟和占用空间由于巨大的内存占用而非线性增长。

You can probably not rely on the precision of any timer for the accuracy you seem to want, https://docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime-- states that 您可能不会依赖任何计时器的精度来达到您想要的准确度, https: //docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--表明

This method provides nanosecond precision, but not necessarily nanosecond resolution (that is, how frequently the value changes) - no guarantees are made except that the resolution is at least as good as that of currentTimeMillis(). 此方法提供纳秒级精度,但不一定是纳秒级分辨率(即,值的变化频率) - 除了分辨率至少与currentTimeMillis()的分辨率一样好之外,不做任何保证。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM