When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

Question

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams.

However the performance with Streams is always turning out to be worse than traditional loops.

int totalSize = 32768;
int filterValue = 1280;
int[] array = new int[totalSize];
Random rnd = new Random(0);
int loopCount = 10000;

for (int i = 0; i < totalSize; i++) {
    // array[i] = rnd.nextInt() % 2560; // Unsorted Data
    array[i] = i; // Sorted Data
}

long start = System.nanoTime();
long sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
    }
}
long total = System.nanoTime() - start;
System.out.printf("Conditional Operator Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
            sum += array[c];
        }
    }
}
total = System.nanoTime() - start;
System.out.printf("Branch Statement Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

start = System.nanoTime();
sum = 0;
for (int j = 0; j < loopCount; j++) {
    sum += Arrays.stream(array).parallel().filter(value -> value >= filterValue).sum();
}
total = System.nanoTime() - start;
System.out.printf("Parallel Streams Time : %d ns, (%f sec) %n", total, total / Math.pow(10, 9));

Output :

For Sorted-Array :

 Conditional Operator Time : 294062652 ns, (0.294063 sec) Branch Statement Time : 272992442 ns, (0.272992 sec) Streams Time : 806579913 ns, (0.806580 sec) Parallel Streams Time : 2316150852 ns, (2.316151 sec)

For Un-Sorted Array:

 Conditional Operator Time : 367304250 ns, (0.367304 sec) Branch Statement Time : 906073542 ns, (0.906074 sec) Streams Time : 1268648265 ns, (1.268648 sec) Parallel Streams Time : 2420482313 ns, (2.420482 sec)

I tried the same code using List :
list.stream() instead of Arrays.stream(array)
list.get(c) instead of array[c]

Output :

For Sorted-List :

 Conditional Operator Time : 860514446 ns, (0.860514 sec) Branch Statement Time : 663458668 ns, (0.663459 sec) Streams Time : 2085657481 ns, (2.085657 sec) Parallel Streams Time : 5026680680 ns, (5.026681 sec)

For Un-Sorted List

 Conditional Operator Time : 704120976 ns, (0.704121 sec) Branch Statement Time : 1327838248 ns, (1.327838 sec) Streams Time : 1857880764 ns, (1.857881 sec) Parallel Streams Time : 2504468688 ns, (2.504469 sec)

I referred to few blogs this & this which suggest the same performance issue wrt streams.

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them? Is there something I'm missing out on?
Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?
In none of the scenario's I could see streams taking advantage of branch-prediction (I tried with sorted and unordered streams, but of no use. It gave more than double the performance impact compared to normal streams)?

Answer 1

I agree to the point that programming with streams is nice and easier for some scenarios but when we're losing out on performance, why do we need to use them?

Performance is rarely an issue. It would be usual for 10% of your streams would need to be rewritten as loops to get the performance you need.

Is there something I'm missing out on?

Using parallelStream() is much easier using streams and possibly more efficient as it's hard to write efficient concurrent code.

Which is the scenario in which streams perform equal to loops? Is it only in the case where your function defined takes a lot of time, resulting in a negligible loop performance?

Your benchmark is flawed in the sense that the code hasn't been compiled when it starts. I would do the whole test in a loop as JMH does, or I would use JMH.

In none of the scenario's I could see streams taking advantage of branch-prediction

Branch prediction is a CPU feature not a JVM or streams feature.

Answer 2

Java is a high level language saving the programmer from considering low level performance optimization.

Never choose a certain approach for performance reasons unless you have proven that this is a problem in your real application.

Your measurements show some negative effect for streams, but the difference is below observability. Therefore, it's not a Problem. Also, this Test is a "synthetic" situation and the code may behave completely different in a heavy duty production environment. Furthermore, the machine code created from your Java (byte) code by the JIT may change in future Java (maintenance) releases and make your measurements obsolete.

In conclusion: Choose the syntax or approach that most expresses your (the programmer's) intention . Keep to that same approach or syntax throughout the program unless you have a good reason to change.

Answer 3

Everything is said, but I want to show you how your code should look like using JMH .

@Fork(3)
@BenchmarkMode(Mode.AverageTime)
@Measurement(iterations = 10, timeUnit = TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
@Threads(1)
@Warmup(iterations = 5, timeUnit = TimeUnit.NANOSECONDS)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
public class MyBenchmark {

  private final int totalSize = 32_768;
  private final int filterValue = 1_280;
  private final int loopCount = 10_000;
  // private Random rnd;

  private int[] array;

  @Setup
  public void setup() {
    array = IntStream.range(0, totalSize).toArray();

    // rnd = new Random(0);
    // array = rnd.ints(totalSize).map(i -> i % 2560).toArray();
  }

  @Benchmark
  public long conditionalOperatorTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        sum += array[c] >= filterValue ? array[c] : 0;
      }
    }
    return sum;
  }

  @Benchmark
  public long branchStatementTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      for (int c = 0; c < totalSize; ++c) {
        if (array[c] >= filterValue) {
          sum += array[c];
        }
      }
    }
    return sum;
  }

  @Benchmark
  public long streamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).filter(value -> value >= filterValue).sum();
    }
    return sum;
  }

  @Benchmark
  public long parallelStreamsTime() {
    long sum = 0;
    for (int j = 0; j < loopCount; j++) {
      sum += IntStream.of(array).parallel().filter(value -> value >= filterValue).sum();
    }
    return sum;
  }
}

The results for a sorted array:

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   119833793,881 ±   1345228,723  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   118146194,368 ±   1748693,962  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   499436897,422 ±   7344346,333  ns/op
MyBenchmark.streamsTime              avgt   30  1126768177,407 ± 198712604,716  ns/op

Results for unsorted data:

Benchmark                            Mode  Cnt           Score           Error  Units
MyBenchmark.branchStatementTime      avgt   30   534932594,083 ±   3622551,550  ns/op
MyBenchmark.conditionalOperatorTime  avgt   30   530641033,317 ±   8849037,036  ns/op
MyBenchmark.parallelStreamsTime      avgt   30   489184423,406 ±   5716369,132  ns/op
MyBenchmark.streamsTime              avgt   30  1232020250,900 ± 185772971,366  ns/op

I only can say that there are many possibilities of JVM optimizations and maybe branch-prediction is also involved. Now it is up to you to interpret the benchmark results.

Answer 4

I will add my 0.02$ here.

I just read about Branch-Prediction and wanted to try how this works with Java 8 Streams

Branch Prediction is a CPU feature, it has nothing to do with JVM. It is needed to keep CPU pipeline full and ready to do something. Measuring or predicting the branch prediction is extremely hard (unless you actually know the EXACT things that the CPU will do). This will depend on at least the load that the CPU is having right now (that might be a lot more than your program only).

However the performance with Streams is always turning out to be worse than traditional loops

This statement and the previous one are un-related. Yes, streams will be slower for simple examples like yours, up to 30% slower, which is OK. You could measure for a particular case how slower they are or faster via JMH as others have suggested, but that proves only that case, only that load.

At the same time you might be working with Spring/Hibernate/Services, etc etc that do things in milliseconds and your streams in nano-seconds and you worry about the performance? You are questioning the speed of your fastest part of the code? That's of course a theoretical thing.

And about your last point that you tried with sorted and un-sorted arrays and it gives you bad results. This is absolutely no indication of branch prediction or not - you have no idea at which point the prediction happened and if it did unless you can look inside the actual CPU pipelines - which you did not.

Answer 5

How can my Java program run fast?

Long story short, Java programs can be accelerated by:

Multithreading
JIT

Do streams relate to Java program speedup?

Yes!

Note Collection.parallelStream() and Stream.parallel() methods for multithreading
One can write for cycle that is long enough for JIT to skip. Lambdas are typically small and can be compiled by JIT => there's possibility to gain performance

What is the scenario stream can be faster than `for` loop?

Let's take a look at jdk/src/share/vm/runtime/globals.hpp

develop(intx, HugeMethodLimit,  8000,
        "Don't compile methods larger than this if "
        "+DontCompileHugeMethods")

If you have long enough cycle, it won't be compiled by JIT and will run slowly. If you rewrite such a cycle to stream you'll probably use map , filter , flatMap methods that split code to pieces and every piece can be small enough to fit under limit. For sure, writing huge methods has other downsides apart from JIT compilation. This scenario can be considered if, for example, you've got a lot of generated code.

What's about branch prediction?

Of course streams take advantage of branch prediction as every other code does. However branch prediction isn't the technology explicitly used to make streams faster AFAIK.

So, when do I rewrite my loops to streams to achieve the best performance?

Never.

Premature optimization is the root of all evil © Donald Knuth

Try to optimize algorithm instead. Streams are the interface for functional-like programming, not a tool to speedup loops.

When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

Question

5 answers

solution1
42 ACCPTED 2016-12-22 08:32:34

solution2
27 2016-12-22 08:44:30

solution3
16 2016-12-22 12:40:43

solution4
10 2016-12-22 22:03:48

solution5
4 2016-12-27 23:13:18

How can my Java program run fast?

Do streams relate to Java program speedup?

What is the scenario stream can be faster than `for` loop?

What's about branch prediction?

So, when do I rewrite my loops to streams to achieve the best performance?

When should streams be preferred over traditional loops for best performance? Do streams take advantage of branch-prediction?

Question

5 answers

solution1 42 ACCPTED 2016-12-22 08:32:34

solution2 27 2016-12-22 08:44:30

solution3 16 2016-12-22 12:40:43

solution4 10 2016-12-22 22:03:48

solution5 4 2016-12-27 23:13:18

How can my Java program run fast?

Do streams relate to Java program speedup?

What is the scenario stream can be faster than for loop?

What's about branch prediction?

So, when do I rewrite my loops to streams to achieve the best performance?

solution1
42 ACCPTED 2016-12-22 08:32:34

solution2
27 2016-12-22 08:44:30

solution3
16 2016-12-22 12:40:43

solution4
10 2016-12-22 22:03:48

solution5
4 2016-12-27 23:13:18

What is the scenario stream can be faster than `for` loop?