矩阵乘法，KIJ顺序，并行版本比非并行版本慢

Question

I have a school task about paralel programming and I'm having a lot of problems with it. 我有关于paralel编程的学校任务，我遇到了很多问题。 My task is to create a parallel version of given matrix multiplication code and test its performence (and yes, it has to be in KIJ order): 我的任务是创建给定矩阵乘法代码的并行版本并测试其性能（是的，它必须以KIJ顺序）：

void multiply_matrices_KIJ()
{
    for (int k = 0; k < SIZE; k++)
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}

This is what I came up with so far: 这是我到目前为止提出的：

void multiply_matrices_KIJ()
{
    for (int k = 0; k < SIZE; k++)
#pragma omp parallel
    {
#pragma omp for schedule(static, 16)
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    }
}

And that's where i found something confusing to me. 这就是我发现让我困惑的地方。 This parallel version of the code is running around 50% slower than non-parallel one. 这个并行版本的代码运行速度比非并行版慢50％。 The difference in speed varies only a little bit based on the matrix size (tested SIZE = 128, 256, 512, 1024, 2048, and various schedule versions - dynamic, static, w/o it at all etc. so far). 基于矩阵大小，速度的差异只有一点点的变化（测试的SIZE = 128,256,512,1024,2048和各种计划版本 - 动态，静态，到目前为止还没有它等）。

Can someone help me understand what am I doing wrong? 有人能帮助我理解我做错了什么吗？ Is it maybe because I'm using the KIJ order and it won't get any faster using openMP? 是因为我使用KIJ订单而且使用openMP不会更快？

EDIT: 编辑：

I'm working on a Windows 7 PC, using Visual Studio 2015 Community edition, compiling in Release x86 mode (x64 doesn't help either). 我正在使用Visual Studio 2015社区版在Windows 7 PC上工作，在Release x86模式下进行编译（x64也没有帮助）。 My CPU is: Intel Core i5-2520M CPU @ 2,50GHZ (yes, yes it's a laptop, but I'm getting same results on my home I7 PC) 我的CPU是：英特尔酷睿i5-2520M CPU @ 2,50GHZ（是的，是的，它是一台笔记本电脑，但我在我的家用I7 PC上获得相同的结果）

I'm using global arrays: 我正在使用全局数组：

float matrix_a[SIZE][SIZE];    
float matrix_b[SIZE][SIZE];    
float matrix_r[SIZE][SIZE];

I'm assigning random (float) values to matrix a and b, matrix r is filled with 0s. 我将随机（浮点）值赋给矩阵a和b，矩阵r用0填充。

I've tested the code with various matrix sizes so far (128, 256, 512, 1024, 2048 etc.). 到目前为止，我已经测试了各种矩阵大小的代码（128,256,512,1024,2048等）。 For some of them it is intended NOT to fit in cache. 对于其中一些，它不适合缓存。 My current version of code looks like this: 我当前的代码版本如下所示：

void multiply_matrices_KIJ()
{
#pragma omp parallel 
    {
    for (int k = 0; k < SIZE; k++) {
#pragma omp for schedule(dynamic, 16) nowait
        for (int i = 0; i < SIZE; i++) {
            for (int j = 0; j < SIZE; j++) {
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
            }
        }
    }
    }
}

And just to be clear, I know that with different ordering of loops I could get better results but that is the thing - I HAVE TO use KIJ order. 而且要清楚，我知道循环的顺序不同，我可以得到更好的结果，但事情就是这样 - 我必须使用KIJ命令。 My task is to do the KIJ for loops in parallel and check the performence increase. 我的任务是并行执行KIJ for循环并检查性能的提高。 My problem is that I expect(ed) at least a little faster execution (than the one im getting now which it between 5-10% faster at most) even though it's the I loop that is in parallel (can't do that with K loop because I will get incorrect result since it's matrix_r[i][j]). 我的问题是我期望（ed）至少快一点的执行速度（比我现在最快的速度快5到10％），即使它是并行的I循环（也不能用K循环，因为我将得到不正确的结果，因为它是matrix_r [i] [j]）。

These are the results I'm getting when using the code shown above (I'm doing calculations hundreds of times and getting the average time): 这些是我在使用上面显示的代码时获得的结果（我正在进行数百次计算并得到平均时间）：

SIZE = 128 SIZE = 128

Serial version : 0,000608s 串口版：0,000608s
Parallel I, schedule(dynamic, 16): 0,000683s 并行I，时间表（动态，16）：0,000683s
Parallel I, schedule(static, 16): 0,000647s 并行I，schedule（静态，16）：0,000647s
Parallel J, no schedule: 0,001978s (this is where I exected way slower execution) 并行J，没有时间表：0,001978s（这是我执行速度较慢的地方）

SIZE = 256 SIZE = 256

Serial version: 0,005787s 串行版本：0,005787s
Parallel I, schedule(dynamic, 16): 0,005125s 并行I，时间表（动态，16）：0,005125s
Parallel I, schedule(static, 16): 0,004938s 并行I，schedule（静态，16）：0,004938s
Parallel J, no schedule: 0,013916s 并行J，没有时间表：0,013916s

SIZE = 1024 SIZE = 1024

Serial version: 0,930250s 串行版本：0,930250s
Parallel I, schedule(dynamic, 16): 0,865750s 并行I，时间表（动态，16）：0,865750s
Parallel I, schedule(static, 16): 0,823750s 并行I，schedule（静态，16）：0,823750s
Parallel J, no schedule: 1,137000s 并行J，没有时间表：1,137000s

Answer 1

Note: This answer is not about how to get the best performance out of your loop order or how to parallelize it because I consider it to be suboptimal due to several reasons. 注意：这个答案不是关于如何从循环次序中获得最佳性能或如何并行化它，因为我认为由于几个原因它是次优的。 I'll try to give some advice on how to improve the order (and parallelize it) instead. 我将尝试提供一些关于如何改进订单（并将其并行化）的建议。

Loop order 循环顺序

OpenMP is usually used to distribute work over several CPUs. OpenMP通常用于在多个CPU上分配工作。 Therefore, you want to maximize the workload of each thread while minimizing the amount of required data and information transfer. 因此，您希望最大化每个线程的工作负载，同时最大限度地减少所需数据和信息传输的数量。

You want to execute the outermost loop in parallel instead of the second one. 您希望并行执行最外层循环而不是第二个循环。 Therefore, you'll want to have one of the r_matrix indices as outer loop index in order to avoid race conditions when writing to the result matrix. 因此，您需要将其中一个r_matrix索引作为外部循环索引，以便在写入结果矩阵时避免竞争条件。
The next thing is that you want to traverse the matrices in memory storage order (having the faster changing indices as the second not the first subscript index). 接下来就是你想要以内存存储顺序遍历矩阵（具有更快的变化索引作为第二个而不是第一个下标索引）。

You can achieve both with the following loop/index order: 您可以使用以下循环/索引顺序来实现这两个目的：

for i = 0 to a_rows
  for k = 0 to a_cols
    for j = 0 to b_cols
      r[i][j] = a[i][k]*b[k][j]

Where 哪里

j changes faster than i or k and k changes faster than i . j变化快于i或k ， k变化速度比i快。
i is a result matrix subscript and the i loop can run parallel i是一个结果矩阵下标， i循环可以并行运行

Rearranging your multiply_matrices_KIJ in that way gives quite a bit of a performance boost already. 以这种方式重新排列multiply_matrices_KIJ已经提供了相当多的性能提升。

I did some short tests and the code I used to compare the timings is: 我做了一些简短的测试，我用来比较时间的代码是：

template<class T>
void mm_kij(T const * const matrix_a, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const matrix_b, std::size_t const b_rows, 
  std::size_t const b_cols, T * const matrix_r)
{
  for (std::size_t k = 0; k < a_cols; k++)
  {
    for (std::size_t i = 0; i < a_rows; i++)
    {
      for (std::size_t j = 0; j < b_cols; j++)
      {
        matrix_r[i*b_cols + j] += 
          matrix_a[i*a_cols + k] * matrix_b[k*b_cols + j];
      }
    }
  }
}

mimicing your multiply_matrices_KIJ() function versus 模仿你的multiply_matrices_KIJ()函数

template<class T>
void mm_opt(T const * const a_matrix, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows, 
  std::size_t const b_cols, T * const r_matrix)
{
  for (std::size_t i = 0; i < a_rows; ++i)
  { 
    T * const r_row_p = r_matrix + i*b_cols;
    for (std::size_t k = 0; k < a_cols; ++k)
    { 
      auto const a_val = a_matrix[i*a_cols + k];
      T const * const b_row_p = b_matrix + k * b_cols;
      for (std::size_t j = 0; j < b_cols; ++j)
      { 
        r_row_p[j] += a_val * b_row_p[j];
      }
    }
  }
}

implementing the above mentioned order. 实施上述订单。

Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k 英特尔i5-2500k上两个2048x2048矩阵相乘的时间消耗

mm_kij() : 6.16706s. mm_kij() ：6.16706s。

mm_opt() : 2.6567s. mm_opt() ：2.6567s。

The given order also allows for outer loop parallelization without introducing any race conditions when writing to the result matrix: 给定的顺序还允许外部循环并行化，而不会在写入结果矩阵时引入任何竞争条件：

template<class T>
void mm_opt_par(T const * const a_matrix, std::size_t const a_rows, 
  std::size_t const a_cols, T const * const b_matrix, std::size_t const b_rows, 
  std::size_t const b_cols, T * const r_matrix)
{
#if defined(_OPENMP)
  #pragma omp parallel
  {
    auto ar = static_cast<std::ptrdiff_t>(a_rows);
    #pragma omp for schedule(static) nowait
    for (std::ptrdiff_t i = 0; i < ar; ++i)
#else
    for (std::size_t i = 0; i < a_rows; ++i)
#endif
    {
      T * const r_row_p = r_matrix + i*b_cols;
      for (std::size_t k = 0; k < b_rows; ++k)
      {
        auto const a_val = a_matrix[i*a_cols + k];
        T const * const b_row_p = b_matrix + k * b_cols;
        for (std::size_t j = 0; j < b_cols; ++j)
        {
          r_row_p[j] += a_val * b_row_p[j];
        }
      }
    }
#if defined(_OPENMP)
  }
#endif
}

Where each thread writes to an individual result row 每个线程写入单个结果行的位置

Time consumption for multiplication of two 2048x2048 matrices on Intel i5-2500k (4 OMP threads) 英特尔i5-2500k（4个OMP线程）上两个2048x2048矩阵相乘的时间消耗

mm_kij() : 6.16706s. mm_kij() ：6.16706s。

mm_opt() : 2.6567s. mm_opt() ：2.6567s。

mm_opt_par() : 0.968325s. mm_opt_par() ：0.968325s。

Not perfect scaling but as a start faster than the serial code. 不完美的缩放，但作为一个比串行代码更快的开始。

Answer 2

OpenMP implementations creates a thread pool (although a thread pool is not mandated by the OpenMP standard every implementation of OpenMP I have seen does this) so that threads don't have to be created and destroyed each time a parallel region is entered. OpenMP实现创建了一个线程池（尽管OpenMP标准没有规定线程池，我已经看到过OpenMP的每个实现），因此每次输入并行区域时都不必创建和销毁线程。 Nevertheless, there is a barrier between each parallel region so that all threads have to sync. 然而，每个并行区域之间存在障碍，因此所有线程都必须同步。 There is probably some additional overhead in the fork join model between parallel regions. 并行区域之间的fork连接模型可能存在一些额外的开销。 So even though the threads don't have to be recreated they still have to be initialized between parallel regions. 因此，即使不必重新创建线程，它们仍然必须在并行区域之间进行初始化。 More details can be found here . 更多细节可以在这里找到。

In order to avoid the overhead between entering parallel regions I suggest creating the parallel region on the outermost loop but doing the work sharing on the inner loop over i like this: 为了避免进入并行区域，我建议建立在最外层循环的并行区域，但这样做的内循环工作共享过的开销， i喜欢这样的：

void multiply_matrices_KIJ() {
    #pragma omp parallel
    for (int k = 0; k < SIZE; k++)
        #pragma omp for schedule(static) nowait
        for (int i = 0; i < SIZE; i++)
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
}

There is an implicit barrier when using #pragma omp for . 使用#pragma omp for时存在隐含的障碍。 The nowait clause removes the barrier. nowait子句删除了障碍。

Also make sure you compile with optimizing. 还要确保使用优化进行编译。 There is little point in comparing performance without optimization enabled. 在未启用优化的情况下比较性能几乎没有意义。 I would use -O3 . 我会用-O3 。

Answer 3

Always keep in mind that for caching purposes, the most optimal ordering of your loops will be slowest -> fastest. 请记住，出于缓存目的，循环的最佳排序将是最慢的 - >最快。 In your case, that means I,K,L order. 在你的情况下，这意味着我，K，L顺序。 I would be quite surprised if your serial code is not automatically reordered from KIJ->IKL ordering by your compiler (assuming you have " -O3 "). 如果您的序列代码没有被编译器从KIJ-> IKL排序中自动重新排序（假设您有“ -O3 ”），我会感到非常惊讶。 However, the compiler cannot do this with your parallel loop because that would break the logic you are declaring within your parallel region. 但是，编译器无法使用并行循环执行此操作，因为这会破坏您在并行区域中声明的逻辑。

If you really truly cannot reorder your loops, then your best bet would probably be to rewrite the parallel region to encompass the largest possible loop. 如果你真的无法重新排序你的循环，那么你最好的选择可能是重写并行区域以包含最大可能的循环。 If you have OpenMP 4.0, you could also consider utilizing SIMD vectorization across your fastest dimension as well. 如果您有OpenMP 4.0，您也可以考虑在最快的维度上使用SIMD矢量化。 However, I am still doubtful you will be able to beat your serial code by much because of the aforementioned caching issues inherent in your KIJ ordering... 但是，由于KIJ订购中固有的上述缓存问题，我仍然怀疑你能够击败你的串行代码...

void multiply_matrices_KIJ()
{
    #pragma omp parallel for
    for (int k = 0; k < SIZE; k++)
    {
        for (int i = 0; i < SIZE; i++)
            #pragma omp simd
            for (int j = 0; j < SIZE; j++)
                matrix_r[i][j] += matrix_a[i][k] * matrix_b[k][j];
    }
}

矩阵乘法，KIJ顺序，并行版本比非并行版本慢

问题描述

EDIT: 编辑：

SIZE = 128 SIZE = 128

SIZE = 256 SIZE = 256

SIZE = 1024 SIZE = 1024

3 个解决方案

解决方案1
4 2016-02-26 12:42:41

解决方案2
1 2016-02-26 11:31:01

解决方案3
0 2016-02-25 16:29:08

矩阵乘法，KIJ顺序，并行版本比非并行版本慢

问题描述

EDIT: 编辑：

SIZE = 128 SIZE = 128

SIZE = 256 SIZE = 256

SIZE = 1024 SIZE = 1024

3 个解决方案

解决方案1 4 2016-02-26 12:42:41

解决方案2 1 2016-02-26 11:31:01

解决方案3 0 2016-02-25 16:29:08

解决方案1
4 2016-02-26 12:42:41

解决方案2
1 2016-02-26 11:31:01

解决方案3
0 2016-02-25 16:29:08