简体   繁体   English

分配给std :: vector <std::vector<double> &gt;并行

[英]Assigning to std::vector<std::vector<double>> in parallel

I have some serial code that does a matrix-vector multiply with matrices represented as std::vector<std::vector<double>> and std::vector<double> , respectively: 我有一些串行代码,将矩阵向量与分别表示为std::vector<std::vector<double>>std::vector<double>的矩阵相乘:

void mat_vec_mult(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
                  std::vector<std::vector<double>> *result, size_t beg, size_t end) {
  //  multiply a matrix by a pre-transposed column vector; returns a column vector
  for (auto i = beg; i < end; i++) {
    (*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
  }
}

I would like to parallelize it using OpenMP, which I am trying to learn. 我想使用我正在尝试学习的OpenMP将其并行化。 From here , I got to the following: 这里开始 ,我了解以下内容:

void mat_vec_mult_parallel(const std::vector<std::vector<double>> &mat, const std::vector<double> &vec,
                  std::vector<std::vector<double>> *result, size_t beg, size_t end) {
  //  multiply a matrix by a pre-transposed column vector; returns a column vector
    #pragma omp parallel
    {
        #pragma omp for nowait
          for (auto i = beg; i < end; i++) {
            (*result)[i] = {std::inner_product(mat[i].begin(), mat[i].end(), vec.begin(), 0.0)};
          }
    }
}

This approach has not resulted in any speedup; 这种方法没有导致任何加速。 I would appreciate any help in choosing the correct OpenMP directives. 在选择正确的OpenMP指令方面,我将不胜感激。

There are several things that could explain your lack of seeing performance improvement. 有几件事可以解释您缺乏看到性能改进的原因。 The most promising ones are these: 最有前途的是:

  1. You didn't activate OpenMP support at your compiler's level. 您没有在编译器级别激活OpenMP支持。 Well, from the comments, this seems not to be the case, so this can be ruled out for you. 好吧,从评论来看,情况似乎并非如此,因此可以为您排除这种情况。 I'm still mentioning it as this is so common a mistake that it's better to remind that this is needed. 我仍在提及它,因为这是一个非常常见的错误,因此最好提醒一下这是必需的。
  2. The way you measure your time: beware of CPU time vs. elapsed time. 测量时间的方式:当心CPU时间与经过时间。 See this answer for example to see how to properly measure the elapsed time, as this is the time you want to see decreasing. 例如,请参见以下答案 ,以了解如何正确测量经过时间,因为这是您希望减少的时间。
  3. The fact that your code is memory bound: normally, matrix-matrix multiplication is the type of code that shines for exploiting CPU power. 您的代码受内存限制的事实:通常,矩阵矩阵乘法是利用CPU功能的亮点。 However, that doesn't appear by magic. 但是,这并不是魔术般的。 The code has to be tuned towards that goal. 该代码必须针对该目标进行调整。 And one of the first tuning techniques to apply is tiling / cache blocking. 最早应用的一种调整技术是平铺/缓存阻止。 The aim is to maximize data (re)use while in cache memory, instead of fetching it to central memory. 目的是在缓存中最大化(重用)数据,而不是将其提取到中央内存中。 And from what I can see in your code, the algorithm is doing exactly the opposite, so it streams data from memory for processing, completely ignoring reuse potential. 从您的代码中可以看到,该算法的作用恰恰相反,因此它从内存中流式传输数据进行处理,完全忽略了重用潜力。 So you're memory bound and in this case, sorry, but OpenMP can't help you much. 因此,您受内存限制,在这种情况下,对不起,但是OpenMP并不能为您提供太多帮助。 See this answer for example to see why. 例如,请参阅此答案以了解原因。

These are not the only reasons that could explain some lack of scalability, but with the limited info you give, I think they are the most likely culprits. 这些并不是可以解释某些缺乏可伸缩性的唯一原因,但是鉴于您提供的信息有限,我认为它们是最有可能的罪魁祸首。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM