什么时候流优先于传统循环以获得最佳性能？流是否利用分支预测？

Question

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams. 我刚刚阅读了有关Branch-Prediction的内容，并想尝试使用Java 8 Streams。

However the performance with Streams is always turning out to be worse than traditional loops. 然而，Streams的性能总是比传统的循环更差。

int totalSize = 32768;
int filterValue = 1280;
int[] array = new int[totalSize];
Random rnd = new Random(0);
int loopCount = 10000;

for (int i = 0; i < totalSize; i++) {
    // array[i] = rnd.nextInt() % 2560; // Unsorted Data
    array[i] = i; // Sorted Data
}

long start = System.nanoTime();
long sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
    }
}
long total = System.nanoTime() - start;
System.out.printf("Conditional Operator Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
            sum += array[c];
        }
    }
}
total = System.nanoTime() - start;
System.out.printf("Branch Statement Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).parallel().filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Parallel Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

Output : 输出：

For Sorted-Array : 对于Sorted-Array：

 Conditional Operator Time : 294062652 ns, (0.294063 sec) Branch Statement Time : 272992442 ns, (0.272992 sec) Streams Time : 806579913 ns, (0.806580 sec) Parallel Streams Time : 2316150852 ns, (2.316151 sec)

For Un-Sorted Array: 对于未排序的数组：

 Conditional Operator Time : 367304250 ns, (0.367304 sec) Branch Statement Time : 906073542 ns, (0.906074 sec) Streams Time : 1268648265 ns, (1.268648 sec) Parallel Streams Time : 2420482313 ns, (2.420482 sec)

I tried the same code using List : 我使用List尝试了相同的代码：
list.stream() instead of Arrays.stream(array) list.stream()而不是Arrays.stream(array)
list.get(c) instead of array[c] list.get(c)而不是array[c]

Output : 输出：

For Sorted-List : 对于Sorted-List：

 Conditional Operator Time : 860514446 ns, (0.860514 sec) Branch Statement Time : 663458668 ns, (0.663459 sec) Streams Time : 2085657481 ns, (2.085657 sec) Parallel Streams Time : 5026680680 ns, (5.026681 sec)

For Un-Sorted List 对于未分类列表

 Conditional Operator Time : 704120976 ns, (0.704121 sec) Branch Statement Time : 1327838248 ns, (1.327838 sec) Streams Time : 1857880764 ns, (1.857881 sec) Parallel Streams Time : 2504468688 ns, (2.504469 sec)

I referred to few blogs this & this which suggest the same performance issue wrt streams. 我提到一些博客这个和这个这表明相同的性能问题WRT流。

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them? 我同意在某些情况下使用流编程很好而且更容易，但是当我们失去性能时，为什么我们需要使用它们呢？ Is there something I'm missing out on? 有什么我错过了吗？
Which is the scenario in which streams perform equal to loops? 哪个流的执行方式与循环相同？ Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance? 是仅在您定义的函数需要花费大量时间的情况下，导致循环性能可忽略不计？
In none of the scenario's I could see streams taking advantage of branch-prediction (I tried with sorted and unordered streams, but of no use. It gave more than double the performance impact compared to normal streams)? 在任何情况下，我都看不到利用分支预测的流（我尝试使用有序和无序流，但没有用。与普通流相比，它产生的性能影响是其两倍以上）？

Answer 1

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them? 我同意在某些情况下使用流编程很好而且更容易，但是当我们失去性能时，为什么我们需要使用它们呢？

Performance is rarely an issue. 性能很少成为问题。 It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need. 通常需要将10％的流重写为循环以获得所需的性能。

Is there something I'm missing out on? 有什么我错过了吗？

Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code. 使用parallelStream（）比使用流更容易，并且可能更高效，因为编写高效的并发代码很困难。

Which is the scenario in which streams perform equal to loops? 哪个流的执行方式与循环相同？ Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance? 是仅在您定义的函数需要花费大量时间的情况下，导致循环性能可忽略不计？

Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. 您的基准测试存在缺陷，因为代码在启动时尚未编译。 I would do the whole test in a loop as JMH does, or I would use JMH. 我会像JMH一样在循环中完成整个测试，或者我会使用JMH。

In none of the scenario's I could see streams taking advantage of branch-prediction 在任何情景中，我都看不到利用分支预测的流

Branch prediction is a CPU feature not a JVM or streams feature. 分支预测是CPU功能，而不是JVM或流功能。

Answer 2

Java is a high level language saving the programmer from considering low level performance optimization. Java是一种高级语言，可以使程序员不再考虑低级性能优化。

Never choose a certain approach for performance reasons unless you have proven that this is a problem in your real application. 除非您已经证明这是您实际应用中的问题，否则不要出于性能原因选择某种方法。

Your measurements show some negative effect for streams, but the difference is below observability. 您的测量显示对流有一些负面影响，但差异低于可观察性。 Therefore, it's not a Problem. 因此，这不是问题。 Also, this Test is a "synthetic" situation and the code may behave completely different in a heavy duty production environment. 此外，该测试是“合成”情况，并且代码在重型生产环境中可能表现完全不同。 Furthermore, the machine code created from your Java (byte) code by the JIT may change in future Java (maintenance) releases and make your measurements obsolete. 此外，JIT根据您的Java（字节）代码创建的机器代码可能会在将来的Java（维护）版本中发生变化，并使您的测量结果过时。

In conclusion: Choose the syntax or approach that most expresses your (the programmer's) intention . 总之：选择最能表达您 （程序员）意图的语法或方法。 Keep to that same approach or syntax throughout the program unless you have a good reason to change. 在整个程序中保持相同的方法或语法，除非您有充分的理由进行更改。

Answer 3

Everything is said, but I want to show you how your code should look like using JMH . 一切都说了，但我想告诉你你的代码应该如何使用JMH 。

@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {

  private final int totalSize = 32_768;
  private final int filterValue = 1_280;
  private final int loopCount = 10_000;
  // private Random rnd;

  private int[] array;

  @Setup
  public void setup() {
    array = IntStream.range(0, totalSize).toArray();

    // rnd = new Random(0);
    // array = rnd.ints(totalSize).map(i -> i % 2560).toArray();
  }

  @Benchmark
  public long conditionalOperatorTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
      }
    }
    return sum;
  }

  @Benchmark
  public long branchStatementTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
          sum += array[c];
        }
      }
    }
    return sum;
  }

  @Benchmark
  public long streamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).filter(value -> value >= filterValue).sum();
    }
    return sum;
  }

  @Benchmark
  public long parallelStreamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).parallel().filter(value -> value >= filterValue).sum();
    }
    return sum;
  }
}

The results for a sorted array: 排序数组的结果：

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   119833793,881 ±   1345228,723  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   118146194,368 ±   1748693,962  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   499436897,422 ±   7344346,333  ns/op
MyBenchmark.streamsTime              avgt   30  1126768177,407 ± 198712604,716  ns/op

Results for unsorted data: 未排序数据的结果：

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   534932594,083 ±   3622551,550  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   530641033,317 ±   8849037,036  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   489184423,406 ±   5716369,132  ns/op
MyBenchmark.streamsTime              avgt   30  1232020250,900 ± 185772971,366  ns/op

I only can say that there are many possibilities of JVM optimizations and maybe branch-prediction is also involved. 我只能说有很多JVM优化的可能性，也可能涉及分支预测。 Now it is up to you to interpret the benchmark results. 现在由您来解释基准测试结果。

Answer 4

I will add my 0.02$ here. 我会在这里加上0.02美元。

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams 我刚刚阅读了有关Branch-Prediction的内容，并想尝试使用Java 8 Streams

Branch Prediction is a CPU feature, it has nothing to do with JVM. 分支预测是一种CPU功能，它与JVM无关。 It is needed to keep CPU pipeline full and ready to do something. 需要保持CPU管道充满并准备好做某事。 Measuring or predicting the branch prediction is extremely hard (unless you actually know the EXACT things that the CPU will do). 测量或预测分支预测是非常困难的（除非您实际知道CPU将要做的事情）。 This will depend on at least the load that the CPU is having right now (that might be a lot more than your program only). 这至少取决于CPU现在拥有的负载（可能比您的程序要多得多）。

However the performance with Streams is always turning out to be worse than traditional loops 然而，Streams的性能总是比传统的循环更差

This statement and the previous one are un-related. 本声明与前一声明无关。 Yes, streams will be slower for simple examples like yours, up to 30% slower, which is OK. 是的，对于像你这样的简单例子，流会慢一些，速度慢30％，这没关系。 You could measure for a particular case how slower they are or faster via JMH as others have suggested, but that proves only that case, only that load. 你可以测量一个特定情况它们是多么慢或通过JMH更快，正如其他人所建议的那样，但这只证明了这种情况，只有那种负载。

At the same time you might be working with Spring/Hibernate/Services, etc etc that do things in milliseconds and your streams in nano-seconds and you worry about the performance? 与此同时，您可能正在使用Spring / Hibernate / Services等等，以毫秒为单位完成工作，您的流以纳秒为单位，您是否担心性能？ You are questioning the speed of your fastest part of the code? 您在质疑代码中最快部分的速度吗？ That's of course a theoretical thing. 那当然是理论上的事情。

And about your last point that you tried with sorted and un-sorted arrays and it gives you bad results. 关于你最后一点，你尝试使用已排序和未排序的数组，它会给你带来不好的结果。 This is absolutely no indication of branch prediction or not - you have no idea at which point the prediction happened and if it did unless you can look inside the actual CPU pipelines - which you did not. 这绝对没有分支预测的指示 - 你不知道预测发生在哪一点，如果它确实发生，除非你可以查看实际的CPU管道 - 你没有。

Answer 5

How can my Java program run fast? 我的Java程序如何快速运行？

Long story short, Java programs can be accelerated by: 简而言之，Java程序可以通过以下方式加速：

Multithreading 多线程
JIT JIT

Do streams relate to Java program speedup? 流是否与Java程序加速有关？

Yes! 是!

Note Collection.parallelStream() and Stream.parallel() methods for multithreading 注意用于多线程的Collection.parallelStream()和Stream.parallel()方法
One can write for cycle that is long enough for JIT to skip. 人们可以写for周期足够长的JIT跳过。 Lambdas are typically small and can be compiled by JIT => there's possibility to gain performance Lambda通常很小，可以通过JIT编译=>有可能获得性能

What is the scenario stream can be faster than `for` loop? 什么是情景流可以比更快`for`循环？

Let's take a look at jdk/src/share/vm/runtime/globals.hpp 我们来看看jdk / src / share / vm / runtime / globals.hpp

develop(intx, HugeMethodLimit,  8000,
        "Don't compile methods larger than this if "
        "+DontCompileHugeMethods")

If you have long enough cycle, it won't be compiled by JIT and will run slowly. 如果你有足够长的周期，它将不会被JIT编译并且运行缓慢。 If you rewrite such a cycle to stream you'll probably use map , filter , flatMap methods that split code to pieces and every piece can be small enough to fit under limit. 如果你重写这样一个循环流，你可能会使用map ， filter ， flatMap方法将代码分割成碎片，每一块都可以小到足以适应极限。 For sure, writing huge methods has other downsides apart from JIT compilation. 当然，除了JIT编译之外，编写大量方法还有其他缺点。 This scenario can be considered if, for example, you've got a lot of generated code. 例如，如果您有大量生成的代码，则可以考虑这种情况。

What's about branch prediction? 什么是分支预测？

Of course streams take advantage of branch prediction as every other code does. 当然，流可以像其他代码一样利用分支预测。 However branch prediction isn't the technology explicitly used to make streams faster AFAIK. 然而，分支预测并不是明确用于使流更快AFAIK的技术。

So, when do I rewrite my loops to streams to achieve the best performance? 那么，我何时将循环重写为流以获得最佳性能？

Never. 决不。

Premature optimization is the root of all evil © Donald Knuth 过早的优化是所有邪恶的根源© Donald Knuth

Try to optimize algorithm instead. 尝试优化算法。 Streams are the interface for functional-like programming, not a tool to speedup loops. Streams是类似功能的编程接口，而不是加速循环的工具。

什么时候流优先于传统循环以获得最佳性能？流是否利用分支预测？

问题描述

5 个解决方案

解决方案1
42 已采纳 2016-12-22 08:32:34

解决方案2
27 2016-12-22 08:44:30

解决方案3
16 2016-12-22 12:40:43

解决方案4
10 2016-12-22 22:03:48

解决方案5
4 2016-12-27 23:13:18

How can my Java program run fast? 我的Java程序如何快速运行？

Do streams relate to Java program speedup? 流是否与Java程序加速有关？

What is the scenario stream can be faster than `for` loop? 什么是情景流可以比更快`for`循环？

What's about branch prediction? 什么是分支预测？

So, when do I rewrite my loops to streams to achieve the best performance? 那么，我何时将循环重写为流以获得最佳性能？

什么时候流优先于传统循环以获得最佳性能？流是否利用分支预测？

问题描述

5 个解决方案

解决方案1 42 已采纳 2016-12-22 08:32:34

解决方案2 27 2016-12-22 08:44:30

解决方案3 16 2016-12-22 12:40:43

解决方案4 10 2016-12-22 22:03:48

解决方案5 4 2016-12-27 23:13:18

How can my Java program run fast? 我的Java程序如何快速运行？

Do streams relate to Java program speedup? 流是否与Java程序加速有关？

What is the scenario stream can be faster than for loop? 什么是情景流可以比更快for循环？

What's about branch prediction? 什么是分支预测？

So, when do I rewrite my loops to streams to achieve the best performance? 那么，我何时将循环重写为流以获得最佳性能？

解决方案1
42 已采纳 2016-12-22 08:32:34

解决方案2
27 2016-12-22 08:44:30

解决方案3
16 2016-12-22 12:40:43

解决方案4
10 2016-12-22 22:03:48

解决方案5
4 2016-12-27 23:13:18

What is the scenario stream can be faster than `for` loop? 什么是情景流可以比更快`for`循环？