简体   繁体   中英

Why does running multiple lambdas in loops suddenly slow down?

Consider the following code:

public class Playground {

    private static final int MAX = 100_000_000;

    public static void main(String... args) {
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
    }

    public static void execute(Runnable task) {
        Stopwatch stopwatch = Stopwatch.createStarted();
        for (int i = 0; i < MAX; i++) {
            task.run();
        }
        System.out.println(stopwatch);
    }

}

This currently prints the following on my Intel MBP on Temurin 17:

3.675 ms
1.948 ms
216.9 ms
243.3 ms

Notice the 100* slowdown for the third (and any subsequent) execution. Now, obviously, this is NOT how to write benchmarks in Java . The loop code doesn't do anything, so I'd expect it to be eliminated for all and any repetitions. Also I could not repeat this effect using JMH which tells me the reason is tricky and fragile.

So, why does this happen? Why would there suddenly be such a catastrophic slowdown, what's going on under the hood? An assumption is that C2 bails on us, but which limitation are we bumping into?

Things that don't change the behavior:

  • using anonymous inner classes instead of lambdas,
  • using 3+ different nested classes instead of lambdas.

Things that "fix" the behavior. Actually the third invocation and all subsequent appear to be much faster, hinting that compilation correctly eliminated the loops completely:

  • using 1-2 nested classes instead of lambdas,
  • using 1-2 lambda instances instead of 4 different ones,
  • not calling task.run() lambdas inside the loop,
  • inlining the execute() method, still maintaining 4 different lambdas.

You can actually replicate this with JMH SingleShot mode:

@BenchmarkMode(Mode.SingleShotTime)
@Warmup(iterations = 0)
@Measurement(iterations = 1)
@Fork(1)
public class Lambdas {

    @Benchmark
    public static void doOne() {
        execute(() -> {});
    }

    @Benchmark
    public static void doFour() {
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
    }

    public static void execute(Runnable task) {
        for (int i = 0; i < 100_000_000; i++) {
            task.run();
        }
    }
}
Benchmark            Mode  Cnt  Score   Error  Units
Lambdas.doFour         ss       0.446           s/op
Lambdas.doOne          ss       0.006           s/op

If you look at -prof perfasm profile for doFour test, you would get a fat clue:

....[Hottest Methods (after inlining)]..............................................................
 32.19%         c2, level 4  org.openjdk.Lambdas$$Lambda$44.0x0000000800c258b8::run, version 664 
 26.16%         c2, level 4  org.openjdk.Lambdas$$Lambda$43.0x0000000800c25698::run, version 658 

There are at least two hot lambdas, and those are represented by different classes. So what you are seeing is likely monomorphic (one target), then bimorphic (two targets), then polymorphic virtual call at task.run .

Virtual call has to choose which class to call the implementation from. The more classes you have, the worse it gets for optimizer. JVM tries to adapt, but it gets worse and worse as the run progresses. Roughly like this:

execute(() -> {}); // compiles with single target, fast
execute(() -> {}); // recompiles with two targets, a bit slower
execute(() -> {}); // recompiles with three targets, slow
execute(() -> {}); // continues to be slow

Now, the elimination of the loop requires seeing through the task.run() . In monomorphic and bimorphic cases it is easy: one or both targets are inlined, their empty body is discovered, done. In both cases, you would have to do typechecks, which means it is not completely free, with bimorphic costing a bit extra. When you experience a polymorphic call, there is no such luck at all: it is opaque call.

You can add two more benchmarks in the mix to see it:

    @Benchmark
    public static void doFour_Same() {
        Runnable l = () -> {};
        execute(l);
        execute(l);
        execute(l);
        execute(l);
    }

    @Benchmark
    public static void doFour_Pair() {
        Runnable l1 = () -> {};
        Runnable l2 = () -> {};
        execute(l1);
        execute(l1);
        execute(l2);
        execute(l2);
    }

Which would then yield:

Benchmark            Mode  Cnt  Score   Error  Units
Lambdas.doFour         ss       0.445           s/op ; polymorphic
Lambdas.doFour_Pair    ss       0.016           s/op ; bimorphic
Lambdas.doFour_Same    ss       0.008           s/op ; monomorphic
Lambdas.doOne          ss       0.006           s/op

This also explains why your "fixes" help:

using 1-2 nested classes instead of lambdas,

Bimorphic inlining.

 using 1-2 lambda instances instead of 4 different ones,

Bimorphic inlining.

 not calling task.run() lambdas inside the loop,

Avoids polymorphic (opaque) call in the loop, allows loop elimination.

 inlining the execute() method, still maintaining 4 different lambdas.

Avoids a single call site that experiences multiple call targets. In other words, turns a single polymorphic call site into a series of monomorphic call sites each with its own target.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM