Assigning to std::vector<std::vector<double>> in parallel

Question

I have some serial code that does a matrix-vector multiply with matrices represented as std::vector<std::vector<double>> and std::vector<double> , respectively:

void mat_vec_mult(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
                  std::vector<std::vector<double>> *result, size_t beg, size_t end) {
  //  multiply a matrix by a pre-transposed column vector; returns a column vector
  for (auto i = beg; i < end; i++) {
    (*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
  }
}

I would like to parallelize it using OpenMP, which I am trying to learn. From here , I got to the following:

void mat_vec_mult_parallel(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
                  std::vector<std::vector<double>> *result, size_t beg, size_t end) {
  //  multiply a matrix by a pre-transposed column vector; returns a column vector
    #pragma omp parallel
    {
        #pragma omp for nowait
          for (auto i = beg; i < end; i++) {
            (*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
          }
    }
}

This approach has not resulted in any speedup; I would appreciate any help in choosing the correct OpenMP directives.

Answer 1

There are several things that could explain your lack of seeing performance improvement. The most promising ones are these:

You didn't activate OpenMP support at your compiler's level. Well, from the comments, this seems not to be the case, so this can be ruled out for you. I'm still mentioning it as this is so common a mistake that it's better to remind that this is needed.
The way you measure your time: beware of CPU time vs. elapsed time. See this answer for example to see how to properly measure the elapsed time, as this is the time you want to see decreasing.
The fact that your code is memory bound: normally, matrix-matrix multiplication is the type of code that shines for exploiting CPU power. However, that doesn't appear by magic. The code has to be tuned towards that goal. And one of the first tuning techniques to apply is tiling / cache blocking. The aim is to maximize data (re)use while in cache memory, instead of fetching it to central memory. And from what I can see in your code, the algorithm is doing exactly the opposite, so it streams data from memory for processing, completely ignoring reuse potential. So you're memory bound and in this case, sorry, but OpenMP can't help you much. See this answer for example to see why.

These are not the only reasons that could explain some lack of scalability, but with the limited info you give, I think they are the most likely culprits.

Assigning to std::vector<std::vector<double>> in parallel

Question

1 answers

solution1
1 2019-04-11 04:18:57

Assigning to std::vector<std::vector<double>> in parallel

Question

1 answers

solution1 1 2019-04-11 04:18:57

solution1
1 2019-04-11 04:18:57