简体   繁体   English

OpenMP 矩阵乘法花费的时间比预期的要长

[英]OpenMP matrix multiplication takes more time than expected

I am writing an OpenMP program to multiply two matrices.我正在编写一个 OpenMP 程序来将两个矩阵相乘。 The idea is that each thread calculates some part of each cell's result.这个想法是每个线程计算每个单元格结果的一部分。 Then, after that, I add those results for each cell to get the result of multiplication.然后,在那之后,我为每个单元格添加这些结果以获得乘法结果。

The problem is that the program takes a lot of time when I use large matrices (512x512 or 1024x1024).问题是,当我使用大型矩阵(512x512 或 1024x1024)时,该程序需要很长时间。 Indeed, when I used a matrix of size 1024x1024 using 5 threads, it took 43 seconds, while with 1 thread, it took 14 seconds.实际上,当我使用 5 个线程使用大小为 1024x1024 的矩阵时,需要 43 秒,而使用 1 个线程则需要 14 秒。

I am thinking It might be the critical section causing huge delays.我在想这可能是导致巨大延迟的关键部分。

Here is the code:这是代码:

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

int ** make_array(int n,int m,int f)
{
    int i,j;
    int *linear, **arr;
    linear = malloc(sizeof(int)*m*n);
    arr = malloc(sizeof(int *)*n);
    for(i = 0;i<n;++i) arr[i] = &linear[i*m];
    if(f == 0)
    {
        for(i=0;i<n;++i)
        for(j=0;j<m;++j) arr[i][j] = 0;
        return arr;
    }
    for(i=0;i<n;++i)
        for(j=0;j<m;++j) arr[i][j] = 1+i;
    return arr;
}

void printMat(int **mat, int n)
{
    int i,j;
    for(i = 0; i < n; ++i)
    {
        for(j = 0; j < n;++j)
        {
            printf("%d ",mat[i][j]);
        }
        printf("\n");
    }
}

int main (int argc, char *argv[])
{
    int n;            /// matrix dimension


    scanf("%d", &n);
    double TIME = 0;
    int **a,**b,**c;
    a = make_array(n,n,1);
    b = make_array(n,n,1);
    c = make_array(n,n,0);

    int i,j,k;

    #pragma omp parallel private(i,j,k) shared(a,b,c,TIME)
    {
        double start = omp_get_wtime();
        int **local;
        local = make_array(n,n,0);
        for(i = 0; i < n; ++i)
        {
            for(j = 0; j <n; ++j)
            {
                local[i][j] = 0;
                #pragma omp for schedule(static)
                for(k = 0; k < n; ++k)
                {
                    local[i][j]+= a[i][k] * b[k][j];
                }
            }
        }
        for(i = 0; i <n;++i)
        {
            for(j = 0; j < n; ++j)
            {
                #pragma omp critical
                c[i][j] += local[i][j];
            }
        }
        double end = omp_get_wtime();
        if(TIME < end - start)
        {
            #pragma omp critical
            TIME = end - start;
        }
    }

    printf("%f \n", TIME);
}

Any help would be much appreciated.任何帮助将非常感激。

This code have lot of problems .这段代码有很多问题

The parallelization method is very inefficient :并行化方法效率很低

For each possible i and j you share a very small work to multiple threads.对于每个可能的ij ,您将一个非常小的工作共享给多个线程。 Moreover, there is an implicit barrier at the end of the parallel for loop.此外,在并行 for 循环的末尾有一个隐式障碍 Thus, the communication between threads takes probably much more time than the actual computation.因此,线程之间的通信可能比实际计算花费更多的时间。

The critical sections are usually slow (generally implemented using locks).临界区通常很慢(通常使用锁来实现)。 Here you can replace this by atomic operations.在这里,您可以用原子操作替换它。

With k threads, the code need k-time more memory and is likely to be memory bound (because of the cache and more data to be filled, not to mention additional page faults that are expensive nowadays).使用 k 个线程,代码需要 k 次更多 memory 并且很可能是 memory 绑定的(因为缓存和要填充的更多数据,更不用说现在昂贵的额外页面错误)。

As a result, you need to rework the parallelization method .因此,您需要重新设计并行化方法 You could for example move the #pragma omp for schedule(static) on the i -based loop.例如,您可以在基于i的循环上移动#pragma omp for schedule(static) Alternatively, you can divide the matrix in chunks and share the work between threads.或者,您可以将矩阵分成并在线程之间共享工作。

Please use BLAS libraries to do matrix multiplications.请使用BLAS库进行矩阵乘法。 They are much more optimized than this code.它们比这段代码优化得多。

Here is a list of some other problems:以下是一些其他问题的列表:

  • There are memory leaks (There are malloc but no free ).有 memory 泄漏(有malloc但没有free的)。
  • There is a race condition on the TIME -based condition.基于TIME的条件存在竞争条件。
  • Jagged array may be inefficient here compared to flat arrays.与平坦的 arrays 相比,这里的锯齿状阵列可能效率低下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM