简体   繁体   中英

Using openMP to parallelize a loop over a vector c++ objects

I'm trying to increase the performance of a c++ code using openMP but am not seeing very good scaling. Before delving into the details of my code, I have a very general question that I think could save a lot of time if I can get a definitive answer to it.

The basic structure of the code is a vector of objects (let's say size num_objs = 5000) where each object holds a relatively small vector of doubles (let's say size num_elems = 500). I want to loop through this vector of objects, and for each object, perform a subloop on the member vector to modify each element. I am only attempting to parallelize the outer loop (over the objects) as this is the standard approach with openMP and this loop is much larger than the nested one.

So now for my question. Am I taking a severe performance hit by looping over the array of objects and then looping over each of their smaller member vectors? Should I expect significant increase in performance if I instead made one large vector of size num_objects * num_elems and then did a parallel loop over "chunks" of that big vector that would correspond to the member vectors stored in each object that I described above? That way both the outer loop and inner loop will be accessing data from one big vector rather than have to fetch data from separate objects?

The actual code is much more complicated than it would seem by the above description, so in order to try this alternative approach would require a lot of time modifying. Therefore, I just wanted to get a feel for how significant of a speedup I could get if I spent the time restructuring the entire code. I don't have a lot of knowledge about computer architecture, memory access, caches, etc., so apologies if this is painfully obvious.

EDIT: I was thinking there was possibly a simple answer to this; however, I see that's not really the case. Please consider the following (simplified example).

#include <cmath>
#include <ctime>
#include <iostream>
#include <omp.h>
#include <string>
#include <vector>

class Block {
public:
  static double a;
  std::vector<double> x;
  std::vector<double> y;
  Block(int N);
};

double Block::a = 5;

int main(int argc, char const *argv[]) {
  int num_blocks = 80000;
  int num_elems = 1000;
  int num_iter = 100;

  int nthreads = 1;
  bool parallel_on = true;

  omp_set_num_threads(nthreads);

  std::vector<Block> block_vec;

  for (int i = 0; i < num_blocks; i++) {
    block_vec.push_back(Block(num_elems));
  }

  double start;
  double end;
  start = omp_get_wtime();

  int iter = 0;

  while (iter < num_iter) {
#pragma omp parallel for if (parallel_on)
    for (int bl = 0; bl < num_blocks; bl++) {
      for (int i = 0; i < num_elems; i++) {
        block_vec[bl].x[i] = Block::a * block_vec[bl].y[i] + block_vec[bl].x[i];
      }
    }
    iter++;
    std::cout << "ITER: " << iter << std::endl;
  }

  end = omp_get_wtime();
  double time_taken = end - start;
  std::cout << "TIME: " << time_taken << std::endl;

  return 0;
}

Block::Block(int N) {
  x.assign(N, 2.0);
  y.assign(N, 3.0);
}

I compile this program with:

g++ -fopenmp -O3 saxpy.cpp

I'm running it on an i7-6700 CPU @ 3.40GHz (four physical cores and eight logical cores). Here is the computational time for differing thread counts:

1 THREAD: 8.65s
2 THREAD: 7.37s
3 THREAD: 7.41s
4 THREAD: 7.65s

I did try a version of this code as I described above that makes use of one big vector rather than the nested loop; however, it was about the same result, actually a little slower.

The speed of your program mainly depends on the speed of memory read/write (including cache utilization,etc). Depending on the hardware you may or may not observe speed increase. For more details please read eg this .

On my laptop (i7-8550U, g++ -fopenmp -O3 -mavx2 saxpy.cpp) I got similar result, but on a Xeon server I got significant speed improvement:

nthreads=1     
TIME: 13.0372
real    0m14.303s
user    0m13.206s
sys     0m1.096s

nthreads=4
TIME: 5.1537
real    0m5.921s
user    0m18.473s
sys     0m0.615s

nthreads=8
TIME: 3.43479
real    0m4.237s
user    0m27.337s
sys     0m0.608s

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM