OpenMP和GSL RNG - 性能问题 - 4个线程实现比纯序列1（四核CPU）慢10倍

Question

I am trying to turn a C project of mine from sequential into parallel programming. 我正在尝试将我的C项目从顺序编程转换为并行编程。 Although most of the code has now been redesigned from scratch for this purpose, the generation of random numbers is still at its core. 尽管为此目的，大多数代码现在已经从头开始重新设计，但随机数的生成仍然是其核心。 Thus, bad performance of the random number generator (RNG) affects very badly the overall performance of the program. 因此，随机数发生器（RNG）的不良性能会严重影响程序的整体性能。

I wrote some code lines (see below) to show the problem I am facing without much verbosity. 我写了一些代码行（见下文），以显示我面临的问题而没有太多冗长。

The problem is the following: everytime the number of threads nt increases, the performance gets singnificantly worse. 问题如下：每次线程数增加时，性能都会明显变差。 At this workstation (linux kernel 2.6.33.4; gcc 4.4.4; intel quadcore CPU) the parallel for-loop takes roughly 10x longer to finish with nt=4 than with nt=1, regardless of the number of iterates n. 在这个工作站（linux内核2.6.33.4; gcc 4.4.4; intel四核CPU）中，无论迭代次数n多少，并行for循环使用nt = 4比使用nt = 1大约长10倍。

This situation seems to be described here but the focus is mainly in fortran, a language I know very little about, so I would very much appreciate some help. 这种情况似乎在这里有所描述，但焦点主要集中在fortran，这是一种我对此知之甚少的语言，所以我非常感谢一些帮助。

I tried to follow their idea of creating different RNG (with a different seed) to be accessed by each thread but the performance is still very bad. 我试图按照他们的想法创建不同的RNG（使用不同的种子）来访问每个线程，但性能仍然很差。 Actually, this different seeding point for each thread bugs me as well, because I cannot see how it is possible for one to guarantee the quality of the generated numbers in the end (lack of correlations, etc). 实际上，每个线程的这个不同的播种点也让我感到困惑，因为我无法看到最终如何保证生成的数字的质量（缺乏相关性等）。

I have already thought of dropping GSL altogether and implementing a random generator algorithm (such as Mersenne-Twister) myself but I suspect I would just bump into the same issue later on. 我已经考虑过完全放弃GSL并自己实现一个随机生成器算法（例如Mersenne-Twister），但我怀疑我稍后会遇到同样的问题。

Thank you very much in advance for your answers and advice. 非常感谢您提供的答案和建议。 Please do ask anything important I may have forgotten to mention. 请问我可能忘记提及的任何重要事项。

EDIT: Implemented corrections suggested by lucas1024 (pragma for-loop declaration) and JonathanDursi (seeding; setting "a" as a private variable). 编辑：由lucas1024（pragma for-loop声明）和JonathanDursi（播种;将“a”设置为私有变量）建议的更正。 Performance is still very sluggish in multithread-mode. 多线程模式下的性能仍然非常低迷。

EDIT 2: Implemented solution suggested by Jonathan Dursi (see comments). 编辑2：实施Jonathan Dursi建议的解决方案（见评论）。

    #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>
    #include <gsl/gsl_rng.h>
    #include <omp.h>

    double d_t (struct timespec t1, struct timespec t2){

        return (t2.tv_sec-t1.tv_sec)+(double)(t2.tv_nsec-t1.tv_nsec)/1000000000.0;
    }

    int main (int argc, char *argv[]){

        double a, b;

        int i,j,k;

        int n=atoi(argv[1]), seed=atoi(argv[2]), nt=atoi(argv[3]);

        printf("\nn\t= %d", n);
        printf("\nseed\t= %d", seed);
        printf("\nnt\t= %d", nt);

        struct timespec t1, t2, t3, t4;

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t1);

        //initialize gsl random number generator
        const gsl_rng_type *rng_t;
        gsl_rng **rng;
        gsl_rng_env_setup();
        rng_t = gsl_rng_default;

        rng = (gsl_rng **) malloc(nt * sizeof(gsl_rng *));

            #pragma omp parallel for num_threads(nt)
        for(i=0;i<nt;i++){
            rng[i] = gsl_rng_alloc (rng_t);
            gsl_rng_set(rng[i],seed*i);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t2);

        for (i=0;i<n;i++){
            a = gsl_rng_uniform(rng[0]);
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t3);

        omp_set_num_threads(nt);
        #pragma omp parallel private(j,a)
        {
            j = omp_get_thread_num();
            #pragma omp for
            for(i=0;i<n;i++){
                a = gsl_rng_uniform(rng[j]);
            }
        }

        clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &t4);

        printf("\n\ninitializing:\t\tt1 = %f seconds", d_t(t1,t2));
        printf("\nsequencial for loop:\tt2 = %f seconds", d_t(t2,t3));
        printf("\nparalel for loop:\tt3 = %f seconds (%f * t2)", d_t(t3,t4), (double)d_t(t3,t4)/(double)d_t(t2,t3));
        printf("\nnumber of threads:\tnt = %d\n", nt);

        //free random number generator
        for (i=0;i<nt;i++)
            gsl_rng_free(rng[i]);
        free(rng);

        return 0;

    }

Answer 1

The problem is in the second #pragma omp line. 问题出在第二个#pragma omp行。 The first #pragma omp spawns 4 threads. 第一个#pragma omp产生4个线程。 After that you are supposed to simply say #pragma omp for - not #pragma omp parallel for. 在那之后你应该简单地说#pragma omp for - not #pragma omp parallel for。

With the current code, depending on your omp nesting settings, you are creating 4 x 4 threads that are doing the same work and accessing the same data. 使用当前代码，根据您的omp嵌套设置，您将创建4 x 4个执行相同工作并访问相同数据的线程。

OpenMP和GSL RNG - 性能问题 - 4个线程实现比纯序列1（四核CPU）慢10倍

问题描述

1 个解决方案

解决方案1
4 2012-03-29 00:11:08

OpenMP和GSL RNG - 性能问题 - 4个线程实现比纯序列1（四核CPU）慢10倍

问题描述

1 个解决方案

解决方案1 4 2012-03-29 00:11:08

解决方案1
4 2012-03-29 00:11:08