蒙特卡羅模擬運行速度明顯慢於順序

Question

一般來說，我對並發和並行編程的概念很陌生。 我正在嘗試在 C 中使用蒙特卡洛方法計算 Pi。 這是我的源代碼：

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>

int main(void)
{
    long points;
    long m = 0;
    double coordinates[2];
    double distance;
    printf("Enter the number of points: ");
    scanf("%ld", &points);

    srand((unsigned long) time(NULL));
    for(long i = 0; i < points; i++)
    {
        coordinates[0] = ((double) rand() / (RAND_MAX));
        coordinates[1] = ((double) rand() / (RAND_MAX));
        distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));
        if(distance <= 1)
            m++;
    }

    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

當我嘗試使用 openmp api 使該程序並行時，它的運行速度幾乎慢了 4 倍。

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{

    long total_points;              // Total number of random points which is given by the user
    volatile long total_m = 0;      // Total number of random points which are inside of the circle
    int threads = get_nprocs();     // This is needed so each thred knows how amny random point it should generate
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel
    {
       double coordinates[2];          // Contains the x and y of each random point
       long m = 0;                     // Number of points that are in the circle for any particular thread
       long points = total_points / threads;   // Number of random points that each thread should generate
       double distance;                // Distance of the random point from the center of the circle, if greater than 1 then the point is outside of the circle
       srand((unsigned long) time(NULL));

        for(long i = 0; i < points; i++)
        {
           coordinates[0] = ((double) rand() / (RAND_MAX));    // Random x
           coordinates[1] = ((double) rand() / (RAND_MAX));    // Random y
           distance = sqrt(pow(coordinates[0], 2) + pow(coordinates[1], 2));   // Calculate the distance
          if(distance <= 1)
              m++;
       }

       #pragma omp critical
       {
           total_m += m;
       }
    }

    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);
}

我嘗試查找原因，但對不同的算法有不同的答案。

Answer 1

您的代碼中有兩個開銷來源，即critical region和對rand()的調用。 代替rand()使用rand_r ：

我認為您正在尋找 rand_r()，它明確將當前的 RNG state 作為參數。 然后每個線程應該有它自己的種子數據副本（您是否希望每個線程以相同的種子或不同的種子開始取決於您在做什么，在這里您希望它們不同或者您會得到相同的行一次又一次）。

可以使用 OpenMP 子句reduction來刪除臨界區。 此外，您既不需要調用sqrt也不需要手動將點除以線程（即long points = total_points / threads; ），您可以#pragma omp for 。 因此，您的代碼將如下所示：

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long total_points; 
    long total_m = 0;
    int threads = get_nprocs();   
    printf("Enter the number of points: ");
    scanf("%ld", &total_points);
    omp_set_num_threads(threads);   

    #pragma omp parallel 
    {                  
        unsigned int myseed = omp_get_thread_num();
        #pragma omp for reduction (+: total_m)
        for(long i = 0; i < total_points; i++){
            if(pow((double) rand_r(&myseed) / (RAND_MAX), 2) + pow((double) rand_r(&myseed) / (RAND_MAX), 2) <= 1)
               total_m++;
         }
     }
    printf("Pi is roughly %lf\n", (double) 4*total_m / (double) total_points);

}

在我的機器上快速測試輸入 1000000000：

sequential : 16.282835 seconds 
2 threads  :  8.206498 seconds  (1.98x faster)
4 threads  :  4.107366 seconds  (3.96x faster)
8 threads  :  2.728513 seconds  (5.96x faster)

請記住，我的機器只有 4 個內核。 盡管如此，為了更有意義的比較，應該盡量優化順序代碼，然后將其與並行版本進行比較。 自然，如果順序版本盡可能優化，並行版本的加速可能會下降。 例如，在不修改@user3666197提供的代碼的順序版本的情況下測試當前的並行版本，會產生以下結果：

sequential :  9.343118 seconds 
2 threads  :  8.206498 seconds  (1.13x faster)
4 threads  :  4.107366 seconds  (2.27x faster)
8 threads  :  2.728513 seconds  (3.42x faster)

但是，也可以改進並行版本，等等等等。 例如，如果使用@user3666197版本，修復coordinates更新的競爭條件（線程之間共享），並添加OpenMP #pragma omp for，我們有以下代碼：

int main(void)
{
    double start = omp_get_wtime();
    long points = 1000000000; //....................................... INPUT AVOIDED
    long m = 0;
    unsigned long HAUSNUMERO = 1;
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX;

    int threads = get_nprocs();
    omp_set_num_threads(threads);
    #pragma omp parallel reduction (+: m )
    {
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
        #pragma omp for nowait
        for(long i = 0; i < points; i++)
        {
            double x = rand_r( &aThreadSpecificSEED_x );
            double y = rand_r( &aThreadSpecificSEED_y );
            m += (1  >= ( x * x + y * y ) * DIV1byMAXbyMAX);
        }
    }
    double end = omp_get_wtime();
    printf("%f\n",end-start);
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

產生以下結果：

sequential :  9.160571 seconds 
2 threads  :  4.769141 seconds  (1.92 x faster)
4 threads  :  2.456783 seconds  (3.72 x faster)
8 threads  :  2.203758 seconds  (4.15 x faster)

我正在使用標志-O3 -std=c99 -fopenmp進行編譯，並使用 gcc 版本4.9.3 (MacPorts gcc49 4.9.3_0) 。

Answer 2

您遇到的問題是使用 function rand()所固有的，不需要重入。 因此，當多個線程進入這個 function 時，線程之間就會競爭以非線程安全的方式讀寫數據。 這種競爭導致極其緩慢的行為。 而不是 function rand() ，您可以尋找類似的 function 可重入以擺脫此問題。

Answer 3

您需要將rand()替換為僅訪問局部變量的線程特定隨機數生成器。 否則線程競爭同步相同的高速緩存行。

Answer 4

在阿姆達爾定律的論點之外增加幾分錢

在循環中具有極其微不足道的“有用”工作，AVX-512 寄存器並行和 SIMD 對齊技巧很可能會優於任何針對points << 1E15+的 OpenMP 重量級處理准備。

提供這個答案是為了啟發代碼在哪里可以節省大量成本，因為分析上等效的問題公式（避免昂貴的SQRT -s 和DIV -s，沒有獲得任何附加值）

該代碼可用於Godbolt.org IDE 上的任何進一步的在線實驗和分析。

在Godbolt.org IDE 上修改了簡化代碼，以進行任何進一步的重新測試。

提出定時部分留給@dreamcrash ，因為它有一個水平平原，可以通過有意義的比較進行重新測試：

#include <stdio.h> //............................. -O3 -fopenmp
#include <stdlib.h>
#include <math.h>
#include <time.h>
#include <omp.h>
#include <sys/sysinfo.h>

int main(void)
{
    long points = 1000; //....................................... INPUT AVOIDED
    long m = 0;
//  double coordinates[2]; //.................................... OBVIOUS TO BE PUT IN PRIVATE PART
    unsigned long HAUSNUMERO = 1; //............................. AVOID SIN OF IREPRODUCIBILITY
//  printf( "RAND_MAX is %ld on this platform\n", RAND_MAX );//.. 2147483647 PLATFORM SPECIFIC
    double DIV1byMAXbyMAX = 1. / RAND_MAX / RAND_MAX; //......... PRECOMPUTE A STATIC VALUE

    int threads = get_nprocs();
    omp_set_num_threads(threads);

    #pragma omp parallel reduction (+: m )
    {
    //..............................SEED.x PRINCIPALLY INDEPENDENT FOR MUTUALLY RANDOM SEQ-[x,y]
        unsigned int aThreadSpecificSEED_x = HAUSNUMERO + 1 + omp_get_thread_num();
        unsigned int aThreadSpecificSEED_y = HAUSNUMERO - 1 + omp_get_thread_num();
    //..............................SEED.y PRINCIPALLY INDEPENDENT FOR MUTUALLY RANDOM SEQ-[x,y]
        double x, y;

        for(long i = 0; i < points / threads; i++)
        {   
            x = rand_r( &aThreadSpecificSEED_x );
            y = rand_r( &aThreadSpecificSEED_y );

            if( 1  >= ( x * x //................. NO INTERIM STORAGE NEEDED
                      + y * y //................. NO SQRT EVER NEEDED
                        ) * DIV1byMAXbyMAX //.... MUL is WAY FASTER THAN DIV
                   )
            m++;
        }
    }
    printf("Pi is roughly %lf\n", (double) 4*m / (double) points);
}

蒙特卡羅模擬運行速度明顯慢於順序

問題描述

4 個解決方案

解決方案1
3 已采納 2021-01-04 10:48:10

解決方案2
1 2021-01-04 10:35:54

解決方案3
0 2021-01-04 10:33:58

解決方案4
0 2021-01-04 11:56:42

蒙特卡羅模擬運行速度明顯慢於順序

問題描述

4 個解決方案

解決方案1 3 已采納 2021-01-04 10:48:10

解決方案2 1 2021-01-04 10:35:54

解決方案3 0 2021-01-04 10:33:58

解決方案4 0 2021-01-04 11:56:42

解決方案1
3 已采納 2021-01-04 10:48:10

解決方案2
1 2021-01-04 10:35:54

解決方案3
0 2021-01-04 10:33:58

解決方案4
0 2021-01-04 11:56:42