OpenMP 基本温度建模，并行处理时间比串行代码长

Question

I am new to OpenMP and I have to complete this project where I need to resolve a 2D matrix using Jacobi iteration method to resolve a heat conductivity problem using OpenMP.我是 OpenMP 的新手，我必须完成这个项目，我需要使用 Jacobi 迭代方法解决二维矩阵，以解决使用 OpenMP 的导热问题。

Essentially It is a plate with four walls at the sides which have fixed temperatures and I need to work out the unknown temperature values in the middle.从本质上讲，它是一个在侧面有四个壁的板，具有固定的温度，我需要计算出中间的未知温度值。

The code has been given to us and what I am expected to do is three simple things:代码已提供给我们，我需要做的是三件简单的事情：

Time the serial code对序列码计时
Parallelise the serial code and compare并行化串行代码并进行比较
Further optimise the parallel code if possible如果可能，进一步优化并行代码

I have ran the serial code and parallelised the code to make a comparison.我已经运行了串行代码并将代码并行化以进行比较。 Before any optimisation, for some reason the serial code is consistently doing better.在进行任何优化之前，由于某种原因，串行代码一直做得更好。 I can't help but think I am doing something wrong?我忍不住想我做错了什么？

I will try compiler optimisations for both, but I expected parallel code to be faster.我将尝试对两者进行编译器优化，但我希望并行代码更快。 I have chosen a large problem size for the matrix including a 100 x 100, 300 x 300 array and every single time almost the serial code is doing better.我为矩阵选择了一个较大的问题大小，包括 100 x 100、300 x 300 数组，并且几乎每次串行代码都做得更好。

Funny thing is, the more threads I add, the slower it gets.有趣的是，我添加的线程越多，它变得越慢。

I understand for a small problem size it would be a larger overhead, but I thought this is a large enough problem size?我知道对于一个小问题规模，这将是一个更大的开销，但我认为这是一个足够大的问题规模？

This is before any significant optimisation, am I doing something obviously wrong that makes it like this?这是在任何重大优化之前，我是否在做一些明显错误的事情，使它变成这样？

Here is the code:这是代码：

Serial code:串行码：

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>

int main(int argc, char *argv[])
{

    int m; 
    int n;
    double tol;// = 0.0001;

    int i, j, iter;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);

    /**
     * @var t, tnew, diff, diffmax,
     * t is the old temprature array,tnew is the new array
     */
    double t[m+2][n+2], tnew[m+1][n+1], diff, difmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    printf("%d %d %lf\n",m,n, tol);

    start = omp_get_wtime();

    // initialise temperature array
    for (i=0; i <= m+1; i++) {
        for (j=0; j <= n+1; j++) {
            t[i][j] = 30.0;
        }
    }

    // fix boundary conditions
    for (i=1; i <= m; i++) {
        t[i][0] = 33.0;
        t[i][n+1] = 42.0;
    }
    for (j=1; j <= n; j++) {
        t[0][j] = 20.0;
        t[m+1][j] = 25.0;
    }

    // main loop
    iter = 0;
    difmax = 1000000.0;
    while (difmax > tol) {

        iter++;

        // update temperature for next iteration
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                tnew[i][j] = (t[i-1][j] + t[i+1][j] + t[i][j-1] + t[i][j+1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        difmax = 0.0;
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                diff = fabs(tnew[i][j]-t[i][j]);
                if (diff > difmax) {
                    difmax = diff;
                }
                // copy new to old temperatures
                t[i][j] = tnew[i][j];
            }
        }

    }

    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  difmax = %9.11lf\n", iter, difmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m+1; i++) {
    //  printf("\n");
    //  for (j=0; j <= n+1; j++) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

Here is the parallel code:这是并行代码：

#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <omp.h>



int main(int argc, char *argv[])
{
    int m; 
    int n;
    double tol;// = 0.0001;

    /**
     * @brief Integer variables
     * @var i external loop (y column array) counter,
     * @var j internal loop (x row array counter) counter,
     * @var iter number of iterations,
     * @var numthreads number of threads
     */
    int i, j, iter, numThreads;

    m = atoi(argv[1]);
    n = atoi(argv[2]);
    tol = atof(argv[3]);
    numThreads = atoi(argv[4]);

    /**
     * @brief Double variables
     * @var t, tnew -> The variable that holds the temprature, the t is the old value and the tnew is the new value,
     * @var diff Measures the difference,
     * @var diffmax 
     * t is the temprature array, I guess it holds the matrix?
     * 
     */
    double t[m+2][n+2], tnew[m+1][n+1], diff, diffmax, privDiffmax;

    /**
     * Timer variables
     * @var start, end 
     */
    double start, end;

    /**
     * @brief Print the problem size & the tolerance
     * This print statement can be there as it is not part of the parallel region
     * We also print the number of threads when printing the problem size & tolerance
     */
    //printf("%d %d %lf %d\n",m,n, tol, numThreads);
    omp_set_num_threads(numThreads);

    /**
     * @brief Initialise the timer
     * 
     */
    start = omp_get_wtime();

    /**
     * @brief Creating the parallel region:
     * Here both loop counters are private:
     */
    #pragma omp parallel private(i, j)
    {
        /**
         * @brief initialise temperature array
         * This can be in a parallel region by itself
         */
        #pragma omp for collapse(2) schedule(static)
        for (i=0; i <= m+1; i++) {
            for (j=0; j <= n+1; j++) {
                t[i][j] = 30.0;
            }
        }

        // fix boundary conditions
        #pragma omp for schedule(static)
        for (i=1; i <= m; i++) {
            t[i][0] = 33.0;
            t[i][n+1] = 42.0;
        }

        #pragma omp for schedule(static)
        for (j=1; j <= n; j++) {
            t[0][j] = 20.0;
            t[m+1][j] = 25.0;
        }

    }   

    // main loop
    iter = 0;
    diffmax = 1000000.0;

    while (diffmax > tol) {

        iter = iter + 1;

        /**
         * @brief update temperature for next iteration
         * Here we have created a parallel for directive, this is the second parallel region
         */
        #pragma omp parallel for private(i, j) collapse(2) schedule(static)
        for (i=1; i <= m; i++) {
            for (j=1; j <= n; j++) {
                tnew[i][j] = (t[i-1][j] + t[i+1][j] + t[i][j-1] + t[i][j+1]) / 4.0;
            }
        }

        // work out maximum difference between old and new temperatures
        diffmax = 0.0;
        
        /**
         * @brief Third parallel region that compares the difference
         */
        #pragma omp parallel private(i, j, privDiffmax, diff)
        {
            privDiffmax = 0.0;
            #pragma omp for collapse(2) schedule(static)
            for (i=1; i <= m; i++) {
                for (j=1; j <= n; j++) {
                    diff = fabs(tnew[i][j]-t[i][j]);
                    if (diff > privDiffmax) {
                        privDiffmax = diff;
                    }
                    // copy new to old temperatures
                    t[i][j] = tnew[i][j];
                }
            }
            #pragma omp critical
            if (privDiffmax > diffmax)
            {
                diffmax = privDiffmax;
            }
        }
        

    }

    //Add timer for the end
    end = omp_get_wtime();

    // print results,
    //Loop tempratures commented out to save performance
    printf("iter = %d  diffmax = %9.11lf\n", iter, diffmax);
    printf("Time in seconds: %lf \n", end - start);
    // for (i=0; i <= m+1; i++) {
    //  printf("\n");
    //  for (j=0; j <= n+1; j++) {
    //      printf("%3.5lf ", t[i][j]);
    //  }
    // }
    // printf("\n");

}

Here are some of the benchmarks for serial code:以下是串行代码的一些基准：

I have ran the code and tested it, I have commented out the print statements as I dont need to see that except for to test.我已经运行了代码并对其进行了测试，我已经注释掉了打印语句，因为除了测试之外我不需要看到它。 The code runs fine but somehow it is slower than the serial code.该代码运行良好，但不知何故它比串行代码慢。

I have an 8 core Apple Mac M1我有一个 8 核 Apple Mac M1

I am new to OpenMP and can't help but think I am missing something.我是 OpenMP 的新手，不禁认为我遗漏了一些东西。 Any advice would be appreciated任何意见，将不胜感激

Answer 1

The problem comes from the overhead of collapse(2) on Clang .问题来自Clang 上的collapse(2)开销。 I can reproduce the problem on Clang 13.0.1 in both -O0 and -O2 on a x86-64 i5-9600KF processor, but not on GCC 11.2.0.我可以在 x86-64 i5-9600KF 处理器上的-O0和-O2中重现 Clang 13.0.1 上的问题，但不能在 GCC 11.2.0 上重现。 Clang generates an inefficient code when collapse(2) is used: it uses an expensive div / idiv instruction in the hot loop to be able to compute the i and j indices.当使用collapse(2)时，Clang 会生成一个低效的代码：它在热循环中使用昂贵的div / idiv指令来计算i和j索引。 Indeed, here is the assembly code of the hot loop of the sequential version (in -O1 to make the code more compact):确实，这里是顺序版本的热循环的汇编代码（在-O1中以使代码更紧凑）：

.LBB0_27:                               #   Parent Loop BB0_15 Depth=1
        movsd   xmm3, qword ptr [rbx + 8*rsi]   # xmm3 = mem[0],zero
        movapd  xmm5, xmm3
        subsd   xmm5, qword ptr [rdi + 8*rsi]
        andpd   xmm5, xmm0
        maxsd   xmm5, xmm2
        movsd   qword ptr [rdi + 8*rsi], xmm3
        add     rsi, 1
        movapd  xmm2, xmm5
        cmp     r12, rsi
        jne     .LBB0_27

Here is the parallel counterpart (still in -O1 ):这是并行对应物（仍在-O1中）：

.LBB3_4:
        mov     rax, rcx
        cqo
        idiv    r12                     # <-------------------
        shl     rax, 32
        add     rax, rdi
        sar     rax, 32
        mov     rbp, rax
        imul    rbp, r13                # <-------------------
        shl     rdx, 32
        add     rdx, rdi
        sar     rdx, 32
        add     rbp, rdx
        movsd   xmm2, qword ptr [r9 + 8*rbp]    # xmm2 = mem[0],zero
        imul    rax, r8                # <-------------------
        add     rax, rdx
        movapd  xmm3, xmm2
        subsd   xmm3, qword ptr [rsi + 8*rax]
        andpd   xmm3, xmm0
        maxsd   xmm3, xmm1
        movsd   qword ptr [rsi + 8*rax], xmm2
        add     rcx, 1
        movapd  xmm1, xmm3
        cmp     rbx, rcx
        jne     .LBB3_4

There are much more instructions to execute because the loop spent most of the time computing indices.因为循环花费了大部分时间来计算索引，所以要执行的指令要多得多。 You can fix this by not using the collapse clause.您可以通过不使用collapse子句来解决此问题。 Theoretically, it should be better to provide more parallelism to compilers and runtime to let them make the best decisions, but in practice they are not optimal and they often need to be assisted/guided.理论上，最好为编译器和运行时提供更多的并行性，让他们做出最好的决定，但实际上它们并不是最优的，而且通常需要辅助/指导。 Note that GCC uses a more efficient approach that consists in computing the division only once per block, so compilers can do this optimization.请注意，GCC 使用一种更有效的方法，即每个块只计算一次除法，因此编译器可以进行这种优化。

Results结果

With `collapse(2)`:
- Sequential:  0.221358 seconds
- Parallel:    0.274861 seconds

Without:
- Sequential:  0.222201 seconds
- Parallel:    0.055710 seconds

Additional notes on performance关于性能的附加说明

For better performance, consider using -O2 or even -O3 .为了获得更好的性能，请考虑使用-O2甚至-O3 。 Also consider using -march=native .还可以考虑使用-march=native 。 -ffast-math can also help if you do not use special floating-point (FP) values like NaN and you do not care about FP associativity.如果您不使用像 NaN 这样的特殊浮点 (FP) 值并且您不关心 FP 关联性， -ffast-math也可以提供帮助。 Not copying the array every time and using a double-buffering method also helps a lot (memory-bound codes do not scale well).不要每次都复制数组并使用双缓冲方法也有很大帮助（内存绑定代码不能很好地扩展）。 Then consider reading a research paper for better performance ( trapezoidal tiling can be used to boost the performance even more but this is quite complex to do).然后考虑阅读研究论文以获得更好的性能（梯形平铺可用于进一步提高性能，但这非常复杂）。 Also note that not using collapse(2) reduce the amount of parallelism which might be a problem on a processor with a lot of cores but in practice having a lot of cores operating on a small array tends to be slow anyway (because of false-sharing and communications).另请注意，不使用collapse(2)会减少并行度，这在具有大量内核的处理器上可能是一个问题，但实际上在小型阵列上运行大量内核往往会很慢（因为错误-分享和交流）。

Special note for M1 processors M1 处理器的特别说明

M1 processors are based on a Big/Little architecture . M1 处理器基于Big/Little 架构。 Such an architecture is good to make sequential codes faster thanks to the few "big" cores that run fast (but also consume a lot of space and energy).由于运行速度快的少数“大”内核（但也消耗大量空间和能量），这样的架构可以使顺序代码更快。 However, running efficiently parallel core is harder because the "little" cores (which are energy efficient and small) are much slower than the big ones introducing load-imbalance issue if all kinds of cores are running simultaneously (IDK if this is the case on M1 by default).但是，有效地运行并行内核更难，因为如果所有类型的内核同时运行，“小”内核（节能且小）比引入负载不平衡问题的大内核慢得多（IDK，如果在这种情况下默认为 M1）。 One solution is to control the execution so to only use the same kind of core.一种解决方案是控制执行，以便仅使用相同类型的内核。 Another solution is to use a dynamic scheduling so to balance the work automatically at runtime (eg. using the clause schedule(guided) or even schedule(dynamic) ).另一种解决方案是使用动态调度，以便在运行时自动平衡工作（例如，使用子句schedule(guided)甚至schedule(dynamic) ）。 The second solution tends to add significant overhead and is known to cause other tricky issues on (NUMA-based) computing servers (or even recent AMD PC processors).第二种解决方案往往会增加大量开销，并且已知会在（基于 NUMA 的）计算服务器（甚至最近的 AMD PC 处理器）上引起其他棘手的问题。 It is also important to note that the scaling will not be linear with the number of threads because of the performance difference between the big and little cores.同样重要的是要注意，由于大核和小核之间的性能差异，缩放不会与线程数成线性关系。 Such architecture is currently poorly efficiently supported by a lot of applications because of the above issues.由于上述问题，目前许多应用程序对这种架构的支持效率很低。

OpenMP 基本温度建模，并行处理时间比串行代码长

问题描述

1 个解决方案

解决方案1
5 已采纳 2022-07-14 13:41:21

Results结果

Additional notes on performance关于性能的附加说明

Special note for M1 processors M1 处理器的特别说明

OpenMP 基本温度建模，并行处理时间比串行代码长

问题描述

1 个解决方案

解决方案1 5 已采纳 2022-07-14 13:41:21

Results结果

Additional notes on performance关于性能的附加说明

Special note for M1 processors M1 处理器的特别说明

解决方案1
5 已采纳 2022-07-14 13:41:21