[英]Why does the JVM show more latency for the same block of code after a busy spin pause?
下面的代碼明確地說明了問題,即:
繁忙的旋轉暫停后,完全相同的代碼塊變慢。
請注意,我當然不使用Thread.sleep
。 另請注意,沒有條件導致HotSpot / JIT去優化,因為我正在使用數學運算而不是IF
來更改暫停。
正如您在下面所看到的, 差異很大 ,特別是在暫停變化后的第一次測量中。 這是為什么!?
$ java -server -cp . JvmPauseLatency
Sat Apr 29 10:34:28 EDT 2017 => Please wait 75 seconds for the results...
Sat Apr 29 10:35:43 EDT 2017 => Calculation: 4.0042328611017236E11
Results:
215
214
215
214
215
214
217
215
216
214
216
213
215
214
215
2343 <----- FIRST MEASUREMENT AFTER PAUSE CHANGE
795
727
942
778
765
856
762
801
708
692
765
776
780
754
代碼:
import java.util.Arrays;
import java.util.Date;
import java.util.Random;
public class JvmPauseLatency {
private static final int WARMUP = 20000;
private static final int EXTRA = 15;
private static final long PAUSE = 5 * 1000000000L; // in nanos
private final Random rand = new Random();
private int count;
private double calculation;
private final long[] results = new long[WARMUP + EXTRA];
private long interval = 1; // in nanos
private long busyPause(long pauseInNanos) {
final long start = System.nanoTime();
long until = Long.MAX_VALUE;
while(System.nanoTime() < until) {
until = start + pauseInNanos;
}
return until;
}
public void run() {
long testDuration = ((WARMUP * 1) + (EXTRA * PAUSE)) / 1000000000L;
System.out.println(new Date() +" => Please wait " + testDuration + " seconds for the results...");
while(count < results.length) {
double x = busyPause(interval);
long latency = System.nanoTime();
calculation += x / (rand.nextInt(5) + 1);
calculation -= calculation / (rand.nextInt(5) + 1);
calculation -= x / (rand.nextInt(6) + 1);
calculation += calculation / (rand.nextInt(6) + 1);
latency = System.nanoTime() - latency;
results[count++] = latency;
interval = (count / WARMUP * (PAUSE - 1)) + 1; // it will change to PAUSE when it reaches WARMUP
}
// now print the last (EXTRA * 2) results so you can compare before and after the pause change (from 1 to PAUSE)
System.out.println(new Date() + " => Calculation: " + calculation);
System.out.println("Results:");
long[] array = Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length);
for(long t: array) System.out.println(t);
}
public static void main(String[] args) {
new JvmPauseLatency().run();
}
}
http://www.brendangregg.com/activebenchmarking.html
休閑基准:你基准A,但實際測量B,並得出結論你已經測量C.
看起來你正面臨堆疊更換 。 當OSR發生時,VM暫停,目標函數的堆棧幀被等效幀替換。
根案例是錯誤的microbenchmark - 它沒有得到適當的預熱。 只需在while循環之前將以下行插入基准測試中即可修復它:
System.out.println("WARMUP = " + busyPause(5000000000L));
如何檢查 - 只需使用-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls
運行基准測試-XX:+UnlockDiagnosticVMOptions -XX:+PrintCompilation -XX:+TraceNMethodInstalls
。 我修改了你的代碼 - 現在它在每次調用之前將間隔打印到系統輸出中:
interval = 1
interval = 1
interval = 5000000000
689 145 4 JvmPauseLatency::busyPause (19 bytes) made not entrant
689 146 3 JvmPauseLatency::busyPause (19 bytes)
Installing method (3) JvmPauseLatency.busyPause(J)J
698 147 % 4 JvmPauseLatency::busyPause @ 6 (19 bytes)
Installing osr method (4) JvmPauseLatency.busyPause(J)J @ 6
702 148 4 JvmPauseLatency::busyPause (19 bytes)
705 146 3 JvmPauseLatency::busyPause (19 bytes) made not entrant
Installing method (4) JvmPauseLatency.busyPause(J)J
interval = 5000000000
interval = 5000000000
interval = 5000000000
interval = 5000000000
通常OSR發生在第4層,因此為了禁用它,您可以使用以下選項:
-XX:-TieredCompilation
禁用分層編譯 -XX:-TieredCompilation -XX:TieredStopAtLevel=3
禁用分層編譯到4級 -XX:+TieredCompilation -XX:TieredStopAtLevel=4 -XX:-UseOnStackReplacement
禁用OSR 讓我們從文章https://shipilev.net/blog/2014/nanotrusting-nanotime開始。 簡而言之:
nanoTime()
調用(看看volatile寫入的成本是多少? ) 因此,為了避免所有這些陷阱,您可以使用基於JMH的基准測試,如下所示:
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import org.openjdk.jmh.runner.options.VerboseMode;
import java.util.Random;
import java.util.concurrent.TimeUnit;
@State(Scope.Benchmark)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 2, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 2, time = 3, timeUnit = TimeUnit.SECONDS)
@Fork(value = 2)
public class LatencyTest {
public static final long LONG_PAUSE = 5000L;
public static final long SHORT_PAUSE = 1L;
public Random rand;
@Setup
public void initI() {
rand = new Random(0xDEAD_BEEF);
}
private long busyPause(long pauseInNanos) {
Blackhole.consumeCPU(pauseInNanos);
return pauseInNanos;
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyBusyPauseShort() {
return busyPause(SHORT_PAUSE);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyBusyPauseLong() {
return busyPause(LONG_PAUSE);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long latencyFunc() {
return doCalculation(1);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long measureShort() {
long x = busyPause(SHORT_PAUSE);
return doCalculation(x);
}
@Benchmark
@BenchmarkMode({Mode.AverageTime})
public long measureLong() {
long x = busyPause(LONG_PAUSE);
return doCalculation(x);
}
private long doCalculation(long x) {
long calculation = 0;
calculation += x / (rand.nextInt(5) + 1);
calculation -= calculation / (rand.nextInt(5) + 1);
calculation -= x / (rand.nextInt(6) + 1);
calculation += calculation / (rand.nextInt(6) + 1);
return calculation;
}
public static void main(String[] args) throws RunnerException {
Options options = new OptionsBuilder()
.include(LatencyTest.class.getName())
.verbosity(VerboseMode.NORMAL)
.build();
new Runner(options).run();
}
}
請注意,我已將忙循環實現更改為Blackhole#consumeCPU()以避免與操作系統相關的影響。 所以我的結果是:
Benchmark Mode Cnt Score Error Units
LatencyTest.latencyBusyPauseLong avgt 4 15992.216 ± 106.538 ns/op
LatencyTest.latencyBusyPauseShort avgt 4 6.450 ± 0.163 ns/op
LatencyTest.latencyFunc avgt 4 97.321 ± 0.984 ns/op
LatencyTest.measureLong avgt 4 16103.228 ± 102.338 ns/op
LatencyTest.measureShort avgt 4 100.454 ± 0.041 ns/op
請注意,結果幾乎是加法的,即latencyFunc + latencyBusyPauseShort = measureShort
你的測試有什么問題? 它沒有正確預熱JVM,即它使用一個參數進行預熱而另一個參數進行測試。 為什么這很重要? JVM使用配置文件引導的優化,例如,它計算分支的使用頻率,並為特定配置文件生成“最佳”(無分支)代碼。 那么我們試圖用參數1來預熱JVM我們的基准,JVM生成“最佳代碼”,其中從未采用while循環中的分支。 這是來自JIT編譯日志的事件( -XX:+UnlockDiagnosticVMOptions -XX:+LogCompilation
):
<branch prob="0.0408393" not_taken="40960" taken="1744" cnt="42704" target_bci="42"/>
屬性更改后,JIT使用不常見的陷阱來處理不是最佳的代碼。 我已經創建了一個基於原始基准的基准測試,只有很小的變化:
vdso clock_gettime
,我們無法分析此代碼) _
import java.util.Arrays;
public class JvmPauseLatency {
private static final int WARMUP = 2000 ;
private static final int EXTRA = 10;
private static final long PAUSE = 70000L; // in nanos
private static volatile long consumedCPU = System.nanoTime();
//org.openjdk.jmh.infra.Blackhole.consumeCPU()
private static void consumeCPU(long tokens) {
long t = consumedCPU;
for (long i = tokens; i > 0; i--) {
t += (t * 0x5DEECE66DL + 0xBL + i) & (0xFFFFFFFFFFFFL);
}
if (t == 42) {
consumedCPU += t;
}
}
public void run(long warmPause) {
long[] results = new long[WARMUP + EXTRA];
int count = 0;
long interval = warmPause;
while(count < results.length) {
consumeCPU(interval);
long latency = System.nanoTime();
latency = System.nanoTime() - latency;
results[count++] = latency;
if (count == WARMUP) {
interval = PAUSE;
}
}
System.out.println("Results:" + Arrays.toString(Arrays.copyOfRange(results, results.length - EXTRA * 2, results.length)));
}
public static void main(String[] args) {
int totalCount = 0;
while (totalCount < 100) {
new JvmPauseLatency().run(0);
totalCount ++;
}
}
}
結果是
Results:[62, 66, 63, 64, 62, 62, 60, 58, 65, 61, 127, 245, 140, 85, 88, 114, 76, 199, 310, 196]
Results:[61, 63, 65, 64, 62, 65, 82, 63, 67, 70, 104, 176, 368, 297, 272, 183, 248, 217, 267, 181]
Results:[62, 65, 60, 59, 54, 64, 63, 71, 48, 59, 202, 74, 400, 247, 215, 184, 380, 258, 266, 323]
為了修復這個基准,只需用new JvmPauseLatency().run(0)
替換new JvmPauseLatency().run(0)
new JvmPauseLatency().run(PAUSE);
這是結果:
Results:[46, 45, 44, 45, 48, 46, 43, 72, 50, 47, 46, 44, 54, 45, 43, 43, 43, 48, 46, 43]
Results:[44, 44, 45, 45, 43, 46, 46, 44, 44, 44, 43, 49, 45, 44, 43, 49, 45, 46, 45, 44]
如果你想動態地改變“暫停” - 你必須動態地預熱JVM,即
while(count < results.length) {
consumeCPU(interval);
long latency = System.nanoTime();
latency = System.nanoTime() - latency;
results[count++] = latency;
if (count >= WARMUP) {
interval = PAUSE;
} else {
interval = rnd.nextBoolean() ? PAUSE : 0;
}
}
在基於交換機的解釋器的情況下,我們有很多問題,主要是間接分支指令。 我做了3個實驗:
每個實驗都是通過以下命令啟動sudo perf stat -e cycles,instructions,cache-references,cache-misses,bus-cycles,branch-misses java -Xint JvmPauseLatency
,結果如下:
Performance counter stats for 'java -Xint JvmPauseLatency':
272,822,274,275 cycles
723,420,125,590 instructions # 2.65 insn per cycle
26,994,494 cache-references
8,575,746 cache-misses # 31.769 % of all cache refs
2,060,138,555 bus-cycles
2,930,155 branch-misses
86.808481183 seconds time elapsed
Performance counter stats for 'java -Xint JvmPauseLatency':
2,812,949,238 cycles
7,267,497,946 instructions # 2.58 insn per cycle
6,936,666 cache-references
1,107,318 cache-misses # 15.963 % of all cache refs
21,410,797 bus-cycles
791,441 branch-misses
0.907758181 seconds time elapsed
Performance counter stats for 'java -Xint JvmPauseLatency':
126,157,793 cycles
158,845,300 instructions # 1.26 insn per cycle
6,650,471 cache-references
909,593 cache-misses # 13.677 % of all cache refs
1,635,548 bus-cycles
775,564 branch-misses
0.073511817 seconds time elapsed
如果分支未命中延遲和占用空間由於巨大的內存占用而非線性增長。
您可能不會依賴任何計時器的精度來達到您想要的准確度, https: //docs.oracle.com/javase/8/docs/api/java/lang/System.html#nanoTime--表明
此方法提供納秒級精度,但不一定是納秒級分辨率(即,值的變化頻率) - 除了分辨率至少與currentTimeMillis()的分辨率一樣好之外,不做任何保證。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.