Parallel stream processing vs Thread pool processing Vs Sequential processing

Question

I was just evaluating, which of the code snippets performs better in java 8.

Snippet 1 (Processing in the main thread):

public long doSequence() {
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    ds.forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    });
    System.out.println("Sequence");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

Snippet 2 (Processing in parallel threads):

public long doParallel() {
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    ds.parallel().forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    });
    System.out.println("Parallel");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

Snippet 3 (Processing in parallel threads from a thread pool):

public long doThreadPoolParallel() throws InterruptedException, ExecutionException {
    ForkJoinPool customThreadPool = new ForkJoinPool(4);
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    customThreadPool.submit(() -> ds.parallel().forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    })).get();
    System.out.println("Thread Pool");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

Output is here:

Parallel
6553089257123798384
34 <--34 milli seconds

Thread Pool
6553089257123798384
23 <--23 milli seconds

Sequence
6553089257123798384
12 <--12 milli seconds!

What I expected was

1) Time for processing using thread pool should be minimum, but its not true. (Note that i have not included the thread pool creation time, so it should be fast)

2) Never expected code running in sequence to be the fastest, what should be the reason for that.

I am using a quad core processor.

Appreciate any help to explain the above ambiguity!

Answer 1

Your comparison isn't perfect, surely because of lacking VM warm-up. When I simply repeat the executions, I get different results:

System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());
System.out.println("-------");
System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());
System.out.println("-------");
System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());

Results:

Parallel
6553089257123798384
65
Thread Pool
6553089257123798384
13
Sequence
6553089257123798384
14
-------
Parallel
6553089257123798384
9
Thread Pool
6553089257123798384
4
Sequence
6553089257123798384
8
-------
Parallel
6553089257123798384
8
Thread Pool
6553089257123798384
3
Sequence
6553089257123798384
8

As pointed out by @Erwin in comments, please check answers on this question (rule 1 in this case) for ideas on how to do this benchmarking correctly.

The default parallelism of a parallel stream isn't necessarily the same as that provided by a fork-join pool with as many threads as there are cores on the computer, although the difference between results is still negligible when I switch from your custom pool to the common fork join pool.

Answer 2

AtomicLong.addAndGet requires thread synchronization - every thread has to see the result of the previous addAndGet - you can count on the total being correct.

Although this is not the traditional synchronized synchronization, it still has an overhead. In JDK7, addAndGet employed a spinlock in Java code. In JDK8, it was turned into an intrinsic which is then implemented by a LOCK:XADD instruction emitted by HotSpot on the Intel platform.

It requires cache synchronization between CPU's, which has an overhead. It may even require stuff to be flushed and read from main memory, which is extremely slow compared to code that doesn't need to do that.

It's quite possible, since this synchronization overhead happens for every iteration in your test, that the overhead is larger than any performance gains made from parallelizing.

References:

Parallel stream processing vs Thread pool processing Vs Sequential processing

Question

2 answers

solution1
2 ACCPTED 2018-05-04 05:59:57

solution2
1 2018-05-04 05:52:01

Parallel stream processing vs Thread pool processing Vs Sequential processing

Question

2 answers

solution1 2 ACCPTED 2018-05-04 05:59:57

solution2 1 2018-05-04 05:52:01

solution1
2 ACCPTED 2018-05-04 05:59:57

solution2
1 2018-05-04 05:52:01