为什么这种并行矩阵加法效率如此低下？

Question

I'm very new to multithreading and don't have much experience using inner classes. 我对多线程非常陌生，并且对使用内部类没有太多经验。

The task is to add two matrices containing double values in a parallelized way. 任务是以并行方式添加两个包含双精度值的矩阵。

My idea was to do this recursively, splitting the big matrixes into smaller ones and performing the addition when the matrices reached a certain size limit, then fusing them. 我的想法是递归执行此操作，将大矩阵拆分为较小的矩阵，并在矩阵达到一定大小限制时执行加法，然后将其融合。

The parallelized code runs 40-80x slower than the serialized code. 并行代码的运行速度比串行代码慢40-80倍。

I suspect that I'm doing something wrong here. 我怀疑我在这里做错了。 Perhaps it's because I create so many new matrices, or because I traverse them so many times. 也许是因为我创建了很多新矩阵，或者是因为我遍历了很多次。

Here is the code: 这是代码：

package concurrency;

import java.util.Random;
import java.util.concurrent.ForkJoinPool;
import java.util.concurrent.RecursiveTask;

public class ParallelMatrixAddition {
public static void main(String[] args) {

    Random rand = new Random();

    final int SIZE = 1000;
    double[][] one = new double[SIZE][SIZE];
    double[][] two = new double[SIZE][SIZE];
    double[][] serialSums = new double[SIZE][SIZE];
    double[][] parallelSums = new double[SIZE][SIZE];

    for (int i = 0; i < one.length; i++) {
        for (int j = 0; j < one.length; j++) {
            one[i][j] = rand.nextDouble();
            two[i][j] = rand.nextDouble();
        }
    }

    long serialStartTime = System.currentTimeMillis();

    for (int i = 0; i < SIZE; i++) {
        for (int j = 0; j < SIZE; j++) {
            serialSums[i][j] = one[i][j] + two[i][j];
        }
    }

    long serialEndTime = System.currentTimeMillis();

    System.out.println("Serial runtime is: " + (serialEndTime - serialStartTime) + " milliseconds");

    long startTime = System.currentTimeMillis();

    parallelSums = parallelAddMatrix(one, two);

    long endTime = System.currentTimeMillis();

    System.out.println("Parallel execution took " + (endTime - startTime) + " milliseconds.");

}

public static double[][] parallelAddMatrix(double[][] a, double[][] b) {
    RecursiveTask<double[][]> task = new SumMatricesTask(a, b);
    ForkJoinPool pool = new ForkJoinPool();
    double[][] result = new double[a.length][a.length];
    result = pool.invoke(task);
    return result;
}

@SuppressWarnings("serial")
private static class SumMatricesTask extends RecursiveTask<double[][]> {
    private final static int THRESHOLD = 200;

    private double[][] sumz;
    private double[][] one;
    private double[][] two;

    public SumMatricesTask(double[][] one, double[][] two) {
        this.one = one;
        this.two = two;
        this.sumz = new double[one.length][one.length];
    }

    @Override
    public double[][] compute() {
        if (this.one.length < THRESHOLD) {
            // Compute a sum here.
            // Add the sums of the matrices and store the result in the
            // matrix we will return later.

            double[][] aStuff = new double[this.one.length][this.one.length];

            for (int i = 0; i < one.length; i++) {
                for (int j = 0; j < one.length; j++) {
                    aStuff[i][j] = this.one[i][j] + this.two[i][j];
                }
            }

            return aStuff;

        } else {

            // Split a matrix into four smaller submatrices.
            // Create four forks, then four joins.

            int currentSize = this.one.length;

            int newSize = currentSize / 2;

            double[][] topLeftA = new double[newSize][newSize];
            double[][] topLeftB = new double[newSize][newSize];
            double[][] topLeftSums = new double[newSize][newSize];

            double[][] topRightA = new double[newSize][newSize];
            double[][] topRightB = new double[newSize][newSize];
            double[][] topRightSums = new double[newSize][newSize];

            double[][] bottomLeftA = new double[newSize][newSize];
            double[][] bottomLeftB = new double[newSize][newSize];
            double[][] bottomLeftSums = new double[newSize][newSize];

            double[][] bottomRightA = new double[newSize][newSize];
            double[][] bottomRightB = new double[newSize][newSize];
            double[][] bottomRightSums = new double[newSize][newSize];

            // Populate topLeftA and topLeftB
            for (int i = 0; i < newSize; i++) {
                for (int j = 0; j < newSize; j++) {
                    topLeftA[i][j] = this.one[i][j];
                    topLeftB[i][j] = this.two[i][j];
                }
            }

            // Populate bottomLeftA and bottomLeftB

            for (int i = 0; i < newSize; i++) {
                for (int j = 0; j < newSize; j++) {
                    bottomLeftA[i][j] = this.one[i + newSize][j];
                    bottomLeftB[i][j] = this.two[i + newSize][j];
                }
            }

            // Populate topRightA and topRightB

            for (int i = 0; i < newSize; i++) {
                for (int j = 0; j < newSize; j++) {
                    topRightA[i][j] = this.one[i][j + newSize];
                    topRightB[i][j] = this.two[i][j + newSize];
                }
            }

            // Populate bottomRightA and bottomRightB

            for (int i = 0; i < newSize; i++) {
                for (int j = 0; j < newSize; j++) {
                    bottomRightA[i][j] = this.one[i + newSize][j + newSize];
                    bottomRightB[i][j] = this.two[i + newSize][j + newSize];
                }
            }

            SumMatricesTask topLeft = new SumMatricesTask(topLeftA, topLeftB);
            SumMatricesTask topRight = new SumMatricesTask(topRightA, topRightB);
            SumMatricesTask bottomLeft = new SumMatricesTask(bottomLeftA, bottomLeftB);
            SumMatricesTask bottomRight = new SumMatricesTask(bottomRightA, bottomRightB);

            topLeft.fork();
            topRight.fork();
            bottomLeft.fork();
            bottomRight.fork();

            topLeftSums = topLeft.join();
            topRightSums = topRight.join();
            bottomLeftSums = bottomLeft.join();
            bottomRightSums = bottomRight.join();

            // Fuse the four matrices into one and return it.

            for (int i = 0; i < newSize; i++) {
                for (int j = 0; j < newSize; j++) {
                    this.sumz[i][j] = topLeftSums[i][j];
                }
            }

            for (int i = newSize; i < newSize * 2; i++) {
                for (int j = 0; j < newSize; j++) {
                    this.sumz[i][j] = bottomLeftSums[i - newSize][j];
                }
            }

            for (int i = 0; i < newSize; i++) {
                for (int j = newSize; j < newSize * 2; j++) {
                    this.sumz[i][j] = topRightSums[i][j - newSize];
                }
            }

            for (int i = newSize; i < newSize * 2; i++) {
                for (int j = newSize; j < newSize * 2; j++) {
                    this.sumz[i][j] = bottomRightSums[i - newSize][j - newSize];
                }
            }

            return this.sumz;
        }
    }
}

} }

Thankful for any help. 感谢您的帮助。

Answer 1

Creating an object is many time slower than performing a + even for double . 创建一个对象比对一个double执行+慢许多倍。

This means creating an object is not a good trade off for addition. 这意味着创建对象并不是添加的良好折衷。 To make matters worse, using more memory mean your CPU cache don't work as efficiently and in the worst case what was working in your L1/L2 cpu caches is now in your L3 cache which is shared and not so scaleable, or even worse you end up using main memory. 更糟糕的是，使用更多的内存意味着您的CPU缓存无法高效地工作，并且在最坏的情况下，L1 / L2 cpu缓存中正在工作的内容现在位于L3缓存中，这是共享的，并且不可扩展，甚至更糟您最终将使用主内存。

I suggest you rewrite this so that 我建议你重写一下

you don't create any objects. 您不创建任何对象。
you consider that working across a cache line is more efficient than breaking it up. 您认为跨高速缓存行工作比将其分解更有效。 ie break up the work by rows not columns. 即按行而不是按列来分解工作。
working on a 1D array in Java can be more efficient so think about how you might do this with a 1D array which only appears as a 2D martix. 在Java中处理一维数组可能会更高效，因此请考虑如何使用仅显示为二维martix的一维数组来实现此目的。

Answer 2

You create new arrays all the time. 您一直在创建新的数组。 It is expensive. 它是昂贵的。 Why do You not compute in current arrays? 为什么不在当前数组中计算？ you can just provide borders for each thread. 您可以为每个线程提供边框。

为什么这种并行矩阵加法效率如此低下？

问题描述

2 个解决方案

解决方案1
1 2015-12-03 12:56:25

解决方案2
0 2015-12-03 12:55:19

为什么这种并行矩阵加法效率如此低下？

问题描述

2 个解决方案

解决方案1 1 2015-12-03 12:56:25

解决方案2 0 2015-12-03 12:55:19

解决方案1
1 2015-12-03 12:56:25

解决方案2
0 2015-12-03 12:55:19