简体   繁体   中英

Unexpected Scalability results in Java Fork-Join (Java 8)

Recently, I was running some scalability experiments using Java Fork-Join. Here, I used the non-default ForkJoinPool constructor ForkJoinPool(int parallelism) , passing the desired parallelism (# workers) as constructor argument.

Specifically, using the following piece of code:

public static void main(String[] args) throws InterruptedException {
    ForkJoinPool pool = new ForkJoinPool(Integer.parseInt(args[0]));
    pool.invoke(new ParallelLoopTask());    
}

static class ParallelLoopTask extends RecursiveAction {

    final int n = 1000;

    @Override
    protected void compute() {
        RecursiveAction[] T = new RecursiveAction[n];
        for(int p = 0; p < n; p++){
            T[p] = new DummyTask();
            T[p].fork();
        }
        for(int p = 0; p < n; p++){
            T[p].join();
        }
        /*
        //The problem does not occur when tasks are joined in the reverse order, i.e.
        for(int p = n-1; p >= 0; p--){
            T[p].join();
        }
        */
    }
}


static public class DummyTask extends RecursiveAction {
    //performs some dummy work

    final int N = 10000000;

    //avoid memory bus contention by restricting access to cache (which is distributed)
    double val = 1;

    @Override
    protected void compute() {
        for(int j = 0; j < N; j++){
            if(val < 11){
                val *= 1.1;
            }else{
                val = 1;
            }
        }
    }
}

I got these results on a processor with 4 physical and 8 logical cores (Using java 8: jre1.8.0_45):

T1: 11730

T2: 2381 (speedup: 4,93)

T4: 2463 (speedup: 4,76)

T8: 2418 (speedup: 4,85)

While when using java 7 (jre1.7.0), I get

T1: 11938

T2: 11843 (speedup: 1,01)

T4: 5133 (speedup: 2,33)

T8: 2607 (speedup: 4,58)

(where TP is the execution time in ms, using parallelism level P)

While both results surprise me, the latter I can understand (the join will cause 1 worker (executing the loop) to block, as it fails to recognize that it could, while waiting, process other pending dummy tasks from its local queue). The former, however, got me puzzled.

BTW: When counting the number of started, but not yet completed dummy tasks, I found that up to 24 such tasks existed in a pool with parallelism 2 at some point in time...?

EDIT:

I benchmarked the application above using JMH (jdk1.8.0_45) (options -bm avgt -f 1) (= 1 fork, 20+20 iterations) The results below

T1: 11,664

11,664 ±(99.9%) 0,044 s/op [Average]
(min, avg, max) = (11,597, 11,664, 11,810), stdev = 0,050
CI (99.9%): [11,620, 11,708] (assumes normal distribution)

T2: 4,134 (speedup: 2,82)

4,134 ±(99.9%) 0,787 s/op [Average]
(min, avg, max) = (3,045, 4,134, 5,376), stdev = 0,906
CI (99.9%): [3,348, 4,921] (assumes normal distribution)

T4: 2,972 (speedup: 3,92)

2,972 ±(99.9%) 0,212 s/op [Average]
(min, avg, max) = (2,375, 2,972, 3,200), stdev = 0,245
CI (99.9%): [2,759, 3,184] (assumes normal distribution)

T8: 2,845 (speedup: 4,10)

2,845 ±(99.9%) 0,306 s/op [Average]
(min, avg, max) = (2,277, 2,845, 3,310), stdev = 0,352
CI (99.9%): [2,540, 3,151] (assumes normal distribution)

At first sight one would think these scalability results are closer to what one would expect ie T1 < T2 < T4 ~ T8. However, what still bugs me is the following:

  1. The difference for T2 between java 7 and 8. I guess one explanation would be that the worker executing the parallel loop does not go idle in java 8, but instead finds other work to perform.
  2. The super-linear speedup (3x) with 2 workers. Also, note that T2 seems to increase with every iteration (see below, note that this is also the case, although to a smaller extent with P=4,8). The times in the first iterations of the warmup are similar to those mentioned above. Maybe the warmup period should be longer, but still, isn't it strange that execution time increases, ie I'd rather expect it to decrease?
  3. Finally, I still find the observation that there are a lot more started & not completed dummy tasks than worker threads curious.

>

Run progress: 0,00% complete, ETA 00:00:40
Fork: 1 of 1
Warmup Iteration   1: 2,365 s/op
Warmup Iteration   2: 2,341 s/op
Warmup Iteration   3: 2,393 s/op
Warmup Iteration   4: 2,323 s/op
Warmup Iteration   5: 2,925 s/op
Warmup Iteration   6: 3,040 s/op
Warmup Iteration   7: 2,304 s/op
Warmup Iteration   8: 2,347 s/op
Warmup Iteration   9: 2,939 s/op
Warmup Iteration  10: 3,083 s/op
Warmup Iteration  11: 3,004 s/op
Warmup Iteration  12: 2,327 s/op
Warmup Iteration  13: 3,083 s/op
Warmup Iteration  14: 3,229 s/op
Warmup Iteration  15: 3,076 s/op
Warmup Iteration  16: 2,325 s/op
Warmup Iteration  17: 2,993 s/op
Warmup Iteration  18: 3,112 s/op
Warmup Iteration  19: 3,074 s/op
Warmup Iteration  20: 2,354 s/op
Iteration   1: 3,045 s/op
Iteration   2: 3,094 s/op
Iteration   3: 3,113 s/op
Iteration   4: 3,057 s/op
Iteration   5: 3,050 s/op
Iteration   6: 3,106 s/op
Iteration   7: 3,080 s/op
Iteration   8: 3,370 s/op
Iteration   9: 4,482 s/op
Iteration  10: 4,325 s/op
Iteration  11: 5,002 s/op
Iteration  12: 4,980 s/op
Iteration  13: 5,121 s/op
Iteration  14: 4,310 s/op
Iteration  15: 5,146 s/op
Iteration  16: 5,376 s/op
Iteration  17: 4,810 s/op
Iteration  18: 4,320 s/op
Iteration  19: 5,249 s/op
Iteration  20: 4,654 s/op

There is nothing in your example of how you did this benchmark. It looks like you just did a milli-time at the beginning and end of the run. This is not accurate. I suggest you take a look at this SO answer and re-post your timings. BTW the jmh benchmark is going to be the standard in Java9 so that is what you should be using.

EDIT:

You admit that the scalability results are what you expected. But you say you're still not happy with the results. Now it's time to look inside the code.

There are serious problems with this framework. I've been writing a critique about it since 2010. As I point out here , join doesn't work. The author has tried various means to get around the problem but the problem persists.

Increase your run time to about a minute, (n=100000000) or put some heavy computations in the compute(). Now profile the application in VisualVM or another profiler. This will show you the stalling threads, excessive threads, etc.

If that doesn't help answer your questions than you should look at the code flow using a debugger. Profiling/code analysis is the only way you are going to get satisfactory answers to your questions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM