在C ++中提高矩阵（多维数组）的OpenMP多线程并行计算效率

Question

I just started to use OpenMP to do parallel computing in C++. 我刚刚开始使用OpenMP在C ++中进行并行计算。 The program has a bad parallel performance. 该程序的并行性能很差。 Since I don't know many multi-threading profiling tool (unlike simple gprof for single thread), I wrote a sample program to test the performance. 由于我不了解许多多线程分析工具（与单线程的简单gprof不同），我编写了一个示例程序来测试性能。

I have a 2D matrix(N * N), with each element a 3d vector(x, y, z). 我有一个2D矩阵（N * N），每个元素一个3d向量（x，y，z）。 I simply do a double for loop to set each value in the matrix: 我只是简单地执行double for循环来设置矩阵中的每个值：

for (int i = 0; i < N; ++i) {
  for (int j = 0; j < N; ++j) {
    vectorStack[i][j] = VECTOR3D(1.0*i*i, 1.0*j*j, 1.0*i*j);
  }
}

where VECTOR3D is a simple class has x, y, z attributes: 其中VECTOR3D是一个简单的类，具有x, y, z属性：

class VECTOR3D {
    double x, y, z;                // component along each axis 
}

On the other hand, I can also use a (N * N * 3) 3D array to do this: 另一方面，我也可以使用（N * N * 3）3D数组来执行此操作：

for (int i = 0; i < N; ++i) {
  for (int j = 0; j < N; ++j) {
    arrayHeap[i][j][0] = (1.0*i*i);
    arrayHeap[i][j][1] = (1.0*j*j);
    arrayHeap[i][j][2] = (1.0*i*j);
  }
}

From the memory aspect, there are also several different choices, such as use raw pointers with manually allocate and deallocate: 从内存方面来看，还有几种不同的选择，例如使用带有手动分配和取消分配的原始指针：

  double ***arrayHeap;
  arrayHeap = new double** [N];
  for(int i = 0; i < N; ++i) {
    arrayHeap[i] = new double* [N];
    for(int j = 0; j < N; ++j) {
      arrayHeap[i][j] = new double[3];
    }
  }

or simply use std::vector : 或者只是使用std::vector ：

  vector< vector<VECTOR3D>> vectorStack(N, vector<VECTOR3D>(N, VECTOR3D(0, 0, 0)));

I also considered to manually allocate continuous memory for arrays as did in LAMMP(Molecular Simulation Package source code. 我还考虑过像LAMMP（分子模拟软件包源代码）那样，为阵列手动分配连续内存。

So my results for N=10000 are listed here: 因此，我的N=10000结果在这里列出：

For single thread: 对于单线程：

OMP_NUM_THREADS=1 ./a.out OMP_NUM_THREADS = 1 ./a.out

Allocating memory for array on heap... 为堆上的数组分配内存...

======= Array on heap Results ======= =======堆结果数组=======

Finished within time (total): 0.720385 seconds 完成时间（总计）： 0.720385秒

Finished within time (real): 0.720463 seconds 在时间（真实）内完成： 0.720463秒

Deallocating memory for array on heap... 正在为堆上的数组分配内存...

Allocating memory for array continous... 为数组连续分配内存...

======= Array continuous Results ======= =======数组连续结果=======

Finished within time (total): 0.819733 seconds 在规定时间内完成（总计）： 0.819733秒

Finished within time (real): 0.819774 seconds 在时间（真实）内完成： 0.819774秒

Deallocating memory for array continuous... 为数组连续分配内存...

Allocating memory for vector on heap... 为堆上的向量分配内存...

======= Vector on heap Results ======= =======堆上的向量=======

Finished within time (total): 3.08715 seconds 在规定时间内完成（总计）： 3.08715秒

Finished within time (real): 3.08725 seconds 在规定时间内（真实）完成： 3.08725秒

Deallocating memory for vector on heap... 正在为堆上的向量释放内存...

Allocating memory for vector on stack... 为堆栈上的向量分配内存...

======= Vector on stack Results ======= =======堆栈上的向量结果=======

Finished within time (total): 1.49888 seconds 在规定时间内完成（总计）： 1.49888秒

Finished within time (real): 1.49895 seconds 在时间（真实）内完成： 1.49895秒

For multi-threads (threads=4): 对于多线程（线程= 4）：

OMP_NUM_THREADS=4 ./a.out OMP_NUM_THREADS = 4 ./a.out

Allocating memory for array on heap... 为堆上的数组分配内存...

======= Array on heap Results ======= =======堆结果数组=======

Finished within time (total): 2.29184 seconds 在规定时间内完成（总计）： 2.29184秒

Finished within time (real): 0.577807 seconds 在时间（真实）内完成： 0.577807秒

Deallocating memory for array on heap... 正在为堆上的数组分配内存...

Allocating memory for array continous... 为数组连续分配内存...

======= Array continuous Results ======= =======数组连续结果=======

Finished within time (total): 1.79501 seconds 在时间范围内完成（总计）： 1.79501秒

Finished within time (real): 0.454139 seconds 在时间（真实）内完成： 0.454139秒

Deallocating memory for array continuous... 为数组连续分配内存...

Allocating memory for vector on heap... 为堆上的向量分配内存...

======= Vector on heap Results ======= =======堆上的向量=======

Finished within time (total): 6.80917 seconds 在时间范围内完成（总计）： 6.80917秒

Finished within time (real): 1.92541 seconds 在时间（真实）内完成： 1.92541秒

Deallocating memory for vector on heap... 正在为堆上的向量释放内存...

Allocating memory for vector on stack... 为堆栈上的向量分配内存...

======= Vector on stack Results ======= =======堆栈上的向量结果=======

Finished within time (total): 1.64086 seconds 在规定时间内完成（总计）： 1.64086秒

Finished within time (real): 0.411 seconds 在时间（真实）内完成： 0.411秒

The overall parallel efficiency is not good. 总体并行效率不好。 Unexpected, the fancy continuous memory allocating is not helpful?! 出乎意料的是，花哨的连续内存分配没有帮助吗？ Why is this happening? 为什么会这样呢？ It seems the std::vector is good enough for this case? 在这种情况下，似乎std::vector足够好了吗？

Could someone explain the results for me? 有人可以为我解释结果吗？ and I also need suggestions on how to improve the code. 而且我还需要有关如何改进代码的建议。

Thanks very much!!! 非常感谢！！！

Attached all the source code. 附上所有源代码。 (please directly go to main, there are several functions to manually manage the memory at the beginning). （请直接转到main，开始时有几个功能可以手动管理内存）。

#include <iostream>
#include <omp.h>
#include <vector>
#include <stdlib.h>
#include <cinttypes>
#include "vector3d.h"

typedef int64_t bigint;

void *smalloc(bigint nbytes, const char *name)
{
  if (nbytes == 0) return NULL;
  void *ptr = malloc(nbytes);
  return ptr;
}

template <typename TYPE>
TYPE ***create(TYPE ***&array, int n1, int n2, int n3, const char *name)
{
  bigint nbytes = ((bigint) sizeof(TYPE)) * n1*n2*n3;
  TYPE *data = (TYPE *) smalloc(nbytes,name);
  nbytes = ((bigint) sizeof(TYPE *)) * n1*n2;
  TYPE **plane = (TYPE **) smalloc(nbytes,name);
  nbytes = ((bigint) sizeof(TYPE **)) * n1;
  array = (TYPE ***) smalloc(nbytes,name);

  int i,j;
  bigint m;
  bigint n = 0;
  for (i = 0; i < n1; i++) {
    m = ((bigint) i) * n2;
    array[i] = &plane[m];
    for (j = 0; j < n2; j++) {
      plane[m+j] = &data[n];
      n += n3;
    }
  }
  return array;
}

template <typename TYPE>
TYPE ***create3d_offset(TYPE ***&array, int n1lo, int n1hi,
                        int n2, int n3, const char *name)
{
  int n1 = n1hi - n1lo + 1;
  create(array,n1,n2,n3,name);
  array -= n1lo;
  return array;
}

void sfree(void *ptr) {
  if (ptr == NULL) return;
  free(ptr);
}


template <typename TYPE>
void destroy(TYPE ***&array)
{
  if (array == NULL) return;
  sfree(array[0][0]);
  sfree(array[0]);
  sfree(array);
  array = NULL;
}

template <typename TYPE>
void destroy3d_offset(TYPE ***&array, int offset)
{
  if (array == NULL) return;
  sfree(&array[offset][0][0]);
  sfree(&array[offset][0]);
  sfree(&array[offset]);
  array = NULL;
}


////////////////////////////////////////////////////////
////////////////////////////////////////////////////////
////////////////////////////////////////////////////////

int main() {
  using namespace std;

  const int N = 10000;


  ///////////////////////////////////////

  double sum = 0.0;
  clock_t t;
  double startTime, stopTime, secsElapsed;


  printf("\n\nAllocating memory for array on heap...\n");
  double ***arrayHeap;
  arrayHeap = new double** [N];
  for(int i = 0; i < N; ++i) {
    arrayHeap[i] = new double* [N];
    for(int j = 0; j < N; ++j) {
      arrayHeap[i][j] = new double[3];
    }
  }


  printf("======= Array on heap Results =======\n");

  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        arrayHeap[i][j][0] = (1.0*i*i);
        arrayHeap[i][j][1] = (1.0*j*j);
        arrayHeap[i][j][2] = (1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;

  printf("Deallocating memory for array on heap...\n");
  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; ++j) {
      delete [] arrayHeap[i][j];
    }
    delete [] arrayHeap[i];
  }
  delete [] arrayHeap;


  ///////////////////////////////////////


  printf("\n\nAllocating memory for array continous...\n");
  double ***array_continuous;
  create3d_offset(array_continuous,0, N, N, 3, "array");

  printf("======= Array continuous Results =======\n");

  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        array_continuous[i][j][0] = (1.0*i*i);
        array_continuous[i][j][1] = (1.0*j*j);
        array_continuous[i][j][2] = (1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;


  printf("Deallocating memory for array continuous...\n");
  destroy3d_offset(array_continuous, 0);



///////////////////////////////////////k



  printf("\n\nAllocating memory for vector on heap...\n");
  VECTOR3D ***vectorHeap;
  vectorHeap = new VECTOR3D**[N];
  for(int i = 0; i < N; ++i) {
    vectorHeap[i] = new VECTOR3D* [N];
    for(int j = 0; j < N; ++j) {
      vectorHeap[i][j] = new VECTOR3D(0,0,0);
    }
  }

  printf("======= Vector on heap Results =======\n");
  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        vectorHeap[i][j] = new VECTOR3D(1.0*i*i, 1.0*j*j, 1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;

  printf("Deallocating memory for vector on heap...\n");
  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; ++j) {
      delete [] vectorHeap[i][j];
    }
    delete [] vectorHeap[i];
  }
  delete [] vectorHeap;


  /////////////////////////////////////////////////

  printf("\n\nAllocating memory for vector on stack...\n");
  vector< vector<VECTOR3D>> vectorStack(N, vector<VECTOR3D>(N, VECTOR3D(0, 0, 0)));

  printf("======= Vector on stack Results =======\n");
  sum = 0.0;
  t = clock();
  startTime = omp_get_wtime();

#pragma omp parallel
  {
//#pragma omp for schedule(dynamic)
//#pragma omp for collapse(2)
#pragma omp for
    for (int i = 0; i < N; ++i) {
      for (int j = 0; j < N; ++j) {
        vectorStack[i][j] = VECTOR3D(1.0*i*i, 1.0*j*j, 1.0*i*j);
      }
    }

  }

  t = clock() - t;
  cout << "Finished within time (total): " << ((double) t) / CLOCKS_PER_SEC << " seconds" << endl;
  stopTime = omp_get_wtime();
  secsElapsed = stopTime - startTime;
  cout << "Finished within time (real): " << secsElapsed << " seconds" << endl;


  /////////////////////////////////



  return 0;
}

And the VECTOR3D class: 和VECTOR3D类：

#ifndef _VECTOR3D_H
#define _VECTOR3D_H

#include <iostream>
#include <cmath>
#include <iomanip>
#include <limits>

class VECTOR3D {
public:
  double x, y, z;                // component along each axis (cartesian)
  VECTOR3D(double xx = 0.0, double yy = 0.0, double zz = 0.0) : x(xx), y(yy), z(zz)    // make a 3d vector
  {
  }
}

Answer 1

General Misconception 一般误解

Your trivial loop is not compute bound, but entirely memory bound: You access each element only once. 琐碎的循环不是受计算限制的，而是受内存限制的：您只能访问每个元素一次。 No re-use means that you cannot efficiently use caches. 没有重用意味着您不能有效地使用缓存。 Therefore you can not expect a speedup equal to the number of used threads/cores. 因此，您不能期望加速等于使用的线程/核心数。 The actual speedup depends on the specific system (memory bandwidth). 实际的加速取决于特定的系统（内存带宽）。

Indirection 间接的

All of your data structures, including the fancy continuous memory perform many indirections on the data access. 您的所有数据结构，包括精美的连续存储器 ，都对数据访问执行许多间接操作。 This is not strictly necessary. 这不是严格必要的。 To gain full advantage of the continuous memory, you should simply lay out your 2d array flat: 为了充分利用连续内存的优势，您应该简单地将2d数组平面布局：

template<class T>
class Array2d
{
public:
    Array2d(size_t rows, size_t columns) : rows_(rows), columns_(columns), data_(rows_ * columns_) {}
    T& operator()(size_t r, size_t c)
    {
        return data_[r * columns_ + c];
    }

    const T& operator()(size_t r, size_t c) const
    {
        return data_[r * columns_ + c];
    }

private:
    size_t rows_;
    size_t columns_;
    std::vector<T> data_;
};

Note: You could also make a fancy operator[] that returns a proxy object providing another operator[] if you really must retain the [i][j] indexing. 注意：如果确实必须保留[i][j]索引，则还可以创建一个精美的operator[]来返回提供另一个operator[]的代理对象。

Clarification 澄清度

If you are limited by memory bandwidth and N is large enough, there will be no noticeable performance difference between indirection or flat layout. 如果您受内存带宽的限制并且N足够大，则间接寻址或平面布局之间不会存在明显的性能差异。

在C ++中提高矩阵（多维数组）的OpenMP多线程并行计算效率

问题描述

1 个解决方案

解决方案1
1 2016-11-20 09:05:10

General Misconception 一般误解

Indirection 间接的

Clarification 澄清度

在C ++中提高矩阵（多维数组）的OpenMP多线程并行计算效率

问题描述

1 个解决方案

解决方案1 1 2016-11-20 09:05:10

General Misconception 一般误解

Indirection 间接的

Clarification 澄清度

解决方案1
1 2016-11-20 09:05:10