Java 8嵌套循环，包含流和性能

Question

In order to practise the Java 8 streams I tried converting the following nested loop to the Java 8 stream API. 为了练习Java 8流，我尝试将以下嵌套循环转换为Java 8流API。 It calculates the largest digit sum of a^b (a,b < 100) and takes ~0.135s on my Core i5 760. 它计算a ^ b（a，b <100）的最大数字总和，并在我的Core i5 760上占用~0.135s。

public static int digitSum(BigInteger x)
{
    int sum = 0;
    for(char c: x.toString().toCharArray()) {sum+=Integer.valueOf(c+"");}
    return sum;
}

@Test public void solve()
    {
        int max = 0;
        for(int i=1;i<100;i++)
            for(int j=1;j<100;j++)
                max = Math.max(max,digitSum(BigInteger.valueOf(i).pow(j)));
        System.out.println(max);
    }

My solution, which I expected to be faster because of the paralellism actually took 0.25s (0.19s without the parallel() ): 我的解决方案，我希望由于并行性而更快，实际上需要0.25秒（没有parallel() 0.19s）：

int max =   IntStream.range(1,100).parallel()
            .map(i -> IntStream.range(1, 100)
            .map(j->digitSum(BigInteger.valueOf(i).pow(j)))
            .max().getAsInt()).max().getAsInt();

My questions 我的问题

did I do the conversion right or is there a better way to convert nested loops to stream calculations? 我做了正确的转换，还是有更好的方法将嵌套循环转换为流计算？
why is the stream variant so much slower than the old one? 为什么流变种比旧变种慢得多？
why did the parallel() statement actually increased the time from 0.19s to 0.25s? 为什么parallel（）语句实际上将时间从0.19s增加到0.25s？

I know that microbenchmarks are fragile and parallelism is only worth it for big problems but for a CPU, even 0.1s is an eternity, right? 我知道微基准测试很脆弱，并行性只对大问题是值得的，但对于CPU来说，甚至0.1秒都是永恒的，对吗？

Update 更新

I measure with the Junit 4 framework in Eclipse Kepler (it shows the time taken for executing a test). 我使用Eclipse Kepler中的Junit 4框架进行测量（它显示了执行测试所花费的时间）。

My results for a,b<1000 instead of 100: 我的结果为a，b <1000而不是100：

traditional loop 186s 传统的循环186s
sequential stream 193s 顺序流193s
parallel stream 55s 并行流55s

Update 2 Replacing sum+=Integer.valueOf(c+""); 更新2替换sum+=Integer.valueOf(c+""); with sum+= c - '0'; 加上sum+= c - '0'; (thanks Peter!) shaved off 10 whole seconds of the parallel method, bringing it to 45s. （感谢彼得！）平行方法削减了整整10秒，使其达到45秒。 Didn't expect such a big performance impact! 没想到这么大的性能影响！

Also, reducing the parallelism to the number of CPU cores (4 in my case) didn't do much as it reduced the time only to 44.8s (yes, it adds a and b=0 but I think this won't impact the performance much): 此外，减少与CPU内核数量的并行性（在我的情况下为4）没有做太多，因为它将时间减少到44.8s（是的，它增加了a和b = 0但我认为这不会影响表现很多）：

int max = IntStream.range(0, 3).parallel().
          .map(m -> IntStream.range(0,250)
          .map(i -> IntStream.range(1, 1000)
          .map(j->.digitSum(BigInteger.valueOf(250*m+i).pow(j)))
          .max().getAsInt()).max().getAsInt()).max().getAsInt();

Answer 1

I have created a quick and dirty micro benchmark based on your code. 我已根据您的代码创建了一个快速而肮脏的微基准测试。 The results are: 结果是：

loop: 3192 循环：3192
lambda: 3140 lambda：3140
lambda parallel: 868 lambda parallel：868

So the loop and lambda are equivalent and the parallel stream significantly improves the performance. 因此，循环和lambda是等效的，并行流显着提高了性能。 I suspect your results are unreliable due to your benchmarking methodology. 由于您的基准测试方法，我怀疑您的结果不可靠。

public static void main(String[] args) {
    int sum = 0;

    //warmup
    for (int i = 0; i < 100; i++) {
        solve();
        solveLambda();
        solveLambdaParallel();
    }

    {
        long start = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            sum += solve();
        }
        long end = System.nanoTime();
        System.out.println("loop: " + (end - start) / 1_000_000);
    }
    {
        long start = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            sum += solveLambda();
        }
        long end = System.nanoTime();
        System.out.println("lambda: " + (end - start) / 1_000_000);
    }
    {
        long start = System.nanoTime();
        for (int i = 0; i < 100; i++) {
            sum += solveLambdaParallel();
        }
        long end = System.nanoTime();
        System.out.println("lambda parallel : " + (end - start) / 1_000_000);
    }
    System.out.println(sum);
}

public static int digitSum(BigInteger x) {
    int sum = 0;
    for (char c : x.toString().toCharArray()) {
        sum += Integer.valueOf(c + "");
    }
    return sum;
}

public static int solve() {
    int max = 0;
    for (int i = 1; i < 100; i++) {
        for (int j = 1; j < 100; j++) {
            max = Math.max(max, digitSum(BigInteger.valueOf(i).pow(j)));
        }
    }
    return max;
}

public static int solveLambda() {
    return  IntStream.range(1, 100)
            .map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
            .max().getAsInt();
}

public static int solveLambdaParallel() {
    return  IntStream.range(1, 100)
            .parallel()
            .map(i -> IntStream.range(1, 100).map(j -> digitSum(BigInteger.valueOf(i).pow(j))).max().getAsInt())
            .max().getAsInt();
}

I have also run it with jmh which is more reliable than manual tests. 我也用jmh运行它，这比手动测试更可靠。 The results are consistent with above (micro seconds per call): 结果与上述一致（每次通话微秒）：

Benchmark                                Mode   Mean        Units
c.a.p.SO21968918.solve                   avgt   32367.592   us/op
c.a.p.SO21968918.solveLambda             avgt   31423.123   us/op
c.a.p.SO21968918.solveLambdaParallel     avgt   8125.600    us/op

Answer 2

The problem you have is you are looking at sub-optimal code. 您遇到的问题是您正在寻找次优代码。 When you have code which might be heavily optimised you are very dependant on whether the JVM is smart enough to optimise your code. 当您拥有可能经过大量优化的代码时，您非常依赖于JVM是否足够智能来优化代码。 Loops have been around much longer and are better understood. 循环已经存在很长时间并且更好理解。

One big difference in your loop code, is you working set is very small. 你的循环代码有一个很大的不同，就是你的工作集非常小。 You are only considering one maximum digit sum at a time. 您一次只考虑一个最大数字总和。 This means the code is cache friendly and you have very short lived objects. 这意味着代码是缓存友好的，并且您拥有非常短暂的对象。 In the stream() case you are building up collections for which there more in the working set at any one time, using more cache, with more overhead. 在stream（）情况下，您正在构建集合，在任何时候工作集中都有更多集合，使用更多缓存，并且开销更大。 I would expect your GC times to be longer and/or more frequent as well. 我希望您的GC时间更长和/或更频繁。

why is the stream variant so much slower than the old one? 为什么流变种比旧变种慢得多？

Loops are fairly well optimised having been around since before Java was developed. 自从Java开发之前，循环就已经得到了很好的优化。 They can be mapped very efficiently to hardware. 它们可以非常有效地映射到硬件。 Streams are fairly new and not as heavily optimised. 流是相当新的，并没有经过大量优化。

why did the parallel() statement actually increased the time from 0.19s to 0.25s? 为什么parallel（）语句实际上将时间从0.19s增加到0.25s？

Most likely you have a bottle neck on a shared resource. 很可能你在共享资源上有一个瓶颈。 You create quite a bit of garbage but this is usually fairly concurrent. 你创造了相当多的垃圾，但这通常是相当并发的。 Using more threads, only guarantees you will have more overhead, it doesn't ensure you can take advantage of the extra CPU power you have. 使用更多线程，只保证您将有更多的开销，但它不能确保您可以利用您拥有的额外CPU功率。

Java 8嵌套循环，包含流和性能

问题描述

2 个解决方案

解决方案1
22 已采纳 2014-02-23 13:50:43

解决方案2
3 2014-02-23 13:45:17

Java 8嵌套循环，包含流和性能

问题描述

2 个解决方案

解决方案1 22 已采纳 2014-02-23 13:50:43

解决方案2 3 2014-02-23 13:45:17

解决方案1
22 已采纳 2014-02-23 13:50:43

解决方案2
3 2014-02-23 13:45:17