简体   繁体   English

为什么在循环中运行多个 lambda 会突然变慢?

[英]Why does running multiple lambdas in loops suddenly slow down?

Consider the following code:考虑以下代码:

public class Playground {

    private static final int MAX = 100_000_000;

    public static void main(String... args) {
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
    }

    public static void execute(Runnable task) {
        Stopwatch stopwatch = Stopwatch.createStarted();
        for (int i = 0; i < MAX; i++) {
            task.run();
        }
        System.out.println(stopwatch);
    }

}

This currently prints the following on my Intel MBP on Temurin 17:这目前在 Temurin 17 上的 Intel MBP 上打印以下内容:

3.675 ms
1.948 ms
216.9 ms
243.3 ms

Notice the 100* slowdown for the third (and any subsequent) execution.请注意第三次(以及任何后续)执行的 100* 减速。 Now, obviously, this is NOT how to write benchmarks in Java .现在,显然,这不是在 Java 中编写基准测试的方法 The loop code doesn't do anything, so I'd expect it to be eliminated for all and any repetitions.循环代码没有做任何事情,所以我希望它会被所有的重复消除。 Also I could not repeat this effect using JMH which tells me the reason is tricky and fragile.我也无法使用 JMH 重复这种效果,这告诉我原因是棘手和脆弱的。

So, why does this happen?那么,为什么会发生这种情况? Why would there suddenly be such a catastrophic slowdown, what's going on under the hood?为什么会突然出现如此灾难性的放缓,引擎盖下发生了什么? An assumption is that C2 bails on us, but which limitation are we bumping into?一个假设是 C2 对我们有帮助,但我们遇到了哪个限制?

Things that don't change the behavior:不会改变行为的事情:

  • using anonymous inner classes instead of lambdas,使用匿名内部类而不是 lambda,
  • using 3+ different nested classes instead of lambdas.使用 3+ 个不同的嵌套类而不是 lambda。

Things that "fix" the behavior. “修复”行为的东西。 Actually the third invocation and all subsequent appear to be much faster, hinting that compilation correctly eliminated the loops completely:实际上,第三次调用和所有后续调用似乎要快得多,暗示编译正确地完全消除了循环:

  • using 1-2 nested classes instead of lambdas,使用 1-2 个嵌套类而不是 lambda,
  • using 1-2 lambda instances instead of 4 different ones,使用 1-2 个 lambda 实例而不是 4 个不同的实例,
  • not calling task.run() lambdas inside the loop,不在循环内调用task.run() lambda,
  • inlining the execute() method, still maintaining 4 different lambdas.内联execute()方法,仍然维护 4 个不同的 lambda。

You can actually replicate this with JMH SingleShot mode:您实际上可以使用 JMH SingleShot 模式复制它:

@BenchmarkMode(Mode.SingleShotTime)
@Warmup(iterations = 0)
@Measurement(iterations = 1)
@Fork(1)
public class Lambdas {

    @Benchmark
    public static void doOne() {
        execute(() -> {});
    }

    @Benchmark
    public static void doFour() {
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
    }

    public static void execute(Runnable task) {
        for (int i = 0; i < 100_000_000; i++) {
            task.run();
        }
    }
}
Benchmark            Mode  Cnt  Score   Error  Units
Lambdas.doFour         ss       0.446           s/op
Lambdas.doOne          ss       0.006           s/op

If you look at -prof perfasm profile for doFour test, you would get a fat clue:如果您查看doFour测试的-prof perfasm配置文件,您会得到一个重要的线索:

....[Hottest Methods (after inlining)]..............................................................
 32.19%         c2, level 4  org.openjdk.Lambdas$$Lambda$44.0x0000000800c258b8::run, version 664 
 26.16%         c2, level 4  org.openjdk.Lambdas$$Lambda$43.0x0000000800c25698::run, version 658 

There are at least two hot lambdas, and those are represented by different classes.至少有两个热 lambda,它们由不同的类表示。 So what you are seeing is likely monomorphic (one target), then bimorphic (two targets), then polymorphic virtual call at task.run .所以你看到的可能是单态的(一个目标),然后是双态的(两个目标),然后是task.run的多态虚拟调用。

Virtual call has to choose which class to call the implementation from.虚拟调用必须选择从哪个class调用实现。 The more classes you have, the worse it gets for optimizer.你拥有的类越多,优化器就越糟糕。 JVM tries to adapt, but it gets worse and worse as the run progresses. JVM 试图适应,但随着运行的进行,情况变得越来越糟。 Roughly like this:大致是这样的:

execute(() -> {}); // compiles with single target, fast
execute(() -> {}); // recompiles with two targets, a bit slower
execute(() -> {}); // recompiles with three targets, slow
execute(() -> {}); // continues to be slow

Now, the elimination of the loop requires seeing through the task.run() .现在,消除循环需要看穿task.run() In monomorphic and bimorphic cases it is easy: one or both targets are inlined, their empty body is discovered, done.在单态和双态情况下,这很容易:一个或两个目标都被内联,它们的空体被发现,完成。 In both cases, you would have to do typechecks, which means it is not completely free, with bimorphic costing a bit extra.在这两种情况下,您都必须进行类型检查,这意味着它不是完全免费的,双态需要额外的成本。 When you experience a polymorphic call, there is no such luck at all: it is opaque call.当你遇到多态调用时,根本就没有这样的运气:它是不透明的调用。

You can add two more benchmarks in the mix to see it:您可以在组合中添加另外两个基准来查看它:

    @Benchmark
    public static void doFour_Same() {
        Runnable l = () -> {};
        execute(l);
        execute(l);
        execute(l);
        execute(l);
    }

    @Benchmark
    public static void doFour_Pair() {
        Runnable l1 = () -> {};
        Runnable l2 = () -> {};
        execute(l1);
        execute(l1);
        execute(l2);
        execute(l2);
    }

Which would then yield:然后会产生:

Benchmark            Mode  Cnt  Score   Error  Units
Lambdas.doFour         ss       0.445           s/op ; polymorphic
Lambdas.doFour_Pair    ss       0.016           s/op ; bimorphic
Lambdas.doFour_Same    ss       0.008           s/op ; monomorphic
Lambdas.doOne          ss       0.006           s/op

This also explains why your "fixes" help:这也解释了为什么您的“修复”有帮助:

using 1-2 nested classes instead of lambdas,使用 1-2 个嵌套类而不是 lambda,

Bimorphic inlining.双态内联。

 using 1-2 lambda instances instead of 4 different ones,

Bimorphic inlining.双态内联。

 not calling task.run() lambdas inside the loop,

Avoids polymorphic (opaque) call in the loop, allows loop elimination.避免循环中的多态(不透明)调用,允许循环消除。

 inlining the execute() method, still maintaining 4 different lambdas.

Avoids a single call site that experiences multiple call targets.避免遇到多个呼叫目标的单个呼叫站点。 In other words, turns a single polymorphic call site into a series of monomorphic call sites each with its own target.换句话说,将单个多态调用站点变成一系列单态调用站点,每个调用站点都有自己的目标。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM