Recently, I was running some scalability experiments using Java Fork-Join. Here, I used the non-default ForkJoinPool constructor ForkJoinPool(int parallelism)
, passing the desired parallelism (# workers) as constructor argument.
Specifically, using the following piece of code:
public static void main(String[] args) throws InterruptedException {
ForkJoinPool pool = new ForkJoinPool(Integer.parseInt(args[0]));
pool.invoke(new ParallelLoopTask());
}
static class ParallelLoopTask extends RecursiveAction {
final int n = 1000;
@Override
protected void compute() {
RecursiveAction[] T = new RecursiveAction[n];
for(int p = 0; p < n; p++){
T[p] = new DummyTask();
T[p].fork();
}
for(int p = 0; p < n; p++){
T[p].join();
}
/*
//The problem does not occur when tasks are joined in the reverse order, i.e.
for(int p = n-1; p >= 0; p--){
T[p].join();
}
*/
}
}
static public class DummyTask extends RecursiveAction {
//performs some dummy work
final int N = 10000000;
//avoid memory bus contention by restricting access to cache (which is distributed)
double val = 1;
@Override
protected void compute() {
for(int j = 0; j < N; j++){
if(val < 11){
val *= 1.1;
}else{
val = 1;
}
}
}
}
I got these results on a processor with 4 physical and 8 logical cores (Using java 8: jre1.8.0_45):
T1: 11730
T2: 2381 (speedup: 4,93)
T4: 2463 (speedup: 4,76)
T8: 2418 (speedup: 4,85)
While when using java 7 (jre1.7.0), I get
T1: 11938
T2: 11843 (speedup: 1,01)
T4: 5133 (speedup: 2,33)
T8: 2607 (speedup: 4,58)
(where TP is the execution time in ms, using parallelism level P)
While both results surprise me, the latter I can understand (the join will cause 1 worker (executing the loop) to block, as it fails to recognize that it could, while waiting, process other pending dummy tasks from its local queue). The former, however, got me puzzled.
BTW: When counting the number of started, but not yet completed dummy tasks, I found that up to 24 such tasks existed in a pool with parallelism 2 at some point in time...?
EDIT:
I benchmarked the application above using JMH (jdk1.8.0_45) (options -bm avgt -f 1) (= 1 fork, 20+20 iterations) The results below
T1: 11,664
11,664 ±(99.9%) 0,044 s/op [Average]
(min, avg, max) = (11,597, 11,664, 11,810), stdev = 0,050
CI (99.9%): [11,620, 11,708] (assumes normal distribution)
T2: 4,134 (speedup: 2,82)
4,134 ±(99.9%) 0,787 s/op [Average]
(min, avg, max) = (3,045, 4,134, 5,376), stdev = 0,906
CI (99.9%): [3,348, 4,921] (assumes normal distribution)
T4: 2,972 (speedup: 3,92)
2,972 ±(99.9%) 0,212 s/op [Average]
(min, avg, max) = (2,375, 2,972, 3,200), stdev = 0,245
CI (99.9%): [2,759, 3,184] (assumes normal distribution)
T8: 2,845 (speedup: 4,10)
2,845 ±(99.9%) 0,306 s/op [Average]
(min, avg, max) = (2,277, 2,845, 3,310), stdev = 0,352
CI (99.9%): [2,540, 3,151] (assumes normal distribution)
At first sight one would think these scalability results are closer to what one would expect ie T1 < T2 < T4 ~ T8. However, what still bugs me is the following:
>
Run progress: 0,00% complete, ETA 00:00:40
Fork: 1 of 1
Warmup Iteration 1: 2,365 s/op
Warmup Iteration 2: 2,341 s/op
Warmup Iteration 3: 2,393 s/op
Warmup Iteration 4: 2,323 s/op
Warmup Iteration 5: 2,925 s/op
Warmup Iteration 6: 3,040 s/op
Warmup Iteration 7: 2,304 s/op
Warmup Iteration 8: 2,347 s/op
Warmup Iteration 9: 2,939 s/op
Warmup Iteration 10: 3,083 s/op
Warmup Iteration 11: 3,004 s/op
Warmup Iteration 12: 2,327 s/op
Warmup Iteration 13: 3,083 s/op
Warmup Iteration 14: 3,229 s/op
Warmup Iteration 15: 3,076 s/op
Warmup Iteration 16: 2,325 s/op
Warmup Iteration 17: 2,993 s/op
Warmup Iteration 18: 3,112 s/op
Warmup Iteration 19: 3,074 s/op
Warmup Iteration 20: 2,354 s/op
Iteration 1: 3,045 s/op
Iteration 2: 3,094 s/op
Iteration 3: 3,113 s/op
Iteration 4: 3,057 s/op
Iteration 5: 3,050 s/op
Iteration 6: 3,106 s/op
Iteration 7: 3,080 s/op
Iteration 8: 3,370 s/op
Iteration 9: 4,482 s/op
Iteration 10: 4,325 s/op
Iteration 11: 5,002 s/op
Iteration 12: 4,980 s/op
Iteration 13: 5,121 s/op
Iteration 14: 4,310 s/op
Iteration 15: 5,146 s/op
Iteration 16: 5,376 s/op
Iteration 17: 4,810 s/op
Iteration 18: 4,320 s/op
Iteration 19: 5,249 s/op
Iteration 20: 4,654 s/op
There is nothing in your example of how you did this benchmark. It looks like you just did a milli-time at the beginning and end of the run. This is not accurate. I suggest you take a look at this SO answer and re-post your timings. BTW the jmh benchmark is going to be the standard in Java9 so that is what you should be using.
EDIT:
You admit that the scalability results are what you expected. But you say you're still not happy with the results. Now it's time to look inside the code.
There are serious problems with this framework. I've been writing a critique about it since 2010. As I point out here , join doesn't work. The author has tried various means to get around the problem but the problem persists.
Increase your run time to about a minute, (n=100000000) or put some heavy computations in the compute(). Now profile the application in VisualVM or another profiler. This will show you the stalling threads, excessive threads, etc.
If that doesn't help answer your questions than you should look at the code flow using a debugger. Profiling/code analysis is the only way you are going to get satisfactory answers to your questions.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.