pthreads 没有真正的加速

Question

I am trying to implement the multithreaded version of the Monte-Carlo algorithm.我正在尝试实现蒙特卡罗算法的多线程版本。 Here is my code:这是我的代码：

#define _POSIX_C_SOURCE 200112L

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <time.h>
#include <math.h>
#include <semaphore.h>
#include <errno.h>
#include <stdbool.h>
#include <string.h>

#define MAX_THREADS 12
#define MAX_DOTS 10000000

double sum = 0.0;
sem_t sem;

void reset() {
    sum = 0.0;
}

void* check_dot(void* _iterations) {
    int* iterations = (int*)_iterations;
    for(int i = 0; i < *iterations; ++i) {
        double x = (double)(rand() % 314) / 100;
        double y = (double)(rand() % 100) / 100;
        if(y <= sin(x)) {
            sem_wait(&sem);
            sum += x * y;
            sem_post(&sem);
        }
    }
    return NULL;
}

void* check_dots_advanced(void* _iterations) {
    int* iterations = (int*)_iterations;
    double* res = (double*)malloc(sizeof(double));
    for(int i = 0; i < *iterations; ++i) {
        double x = (double)(rand() % 314) / 100;
        double y = (double)(rand() % 100) / 100;
        if(y <= sin(x)) *res += x * y;
    }
    pthread_exit((void*)res);
}

double run(int threads_num, bool advanced) {
    if(!advanced) sem_init(&sem, 0, 1);
    struct timespec begin, end;
    double elapsed;
    pthread_t threads[threads_num];
    int iters = MAX_DOTS / threads_num;
    for(int i = 0; i < threads_num; ++i) {
        if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
        else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
    }
    if(clock_gettime(CLOCK_REALTIME, &begin) == -1) {
        perror("Unable to get time");
        exit(-1);
    }
    for(int i = 0; i < threads_num; ++i) {
        if(!advanced) pthread_join(threads[i], NULL);
        else {
            void* tmp;
            pthread_join(threads[i], &tmp);
            sum += *((double*)tmp);
            free(tmp);
        }
    }
    if(clock_gettime(CLOCK_REALTIME, &end) == -1) {
        perror("Unable to get time");
        exit(-1);
    }
    if(!advanced) sem_destroy(&sem);
    elapsed = end.tv_sec - begin.tv_sec;
    elapsed += (end.tv_nsec - begin.tv_nsec) / 1000000000.0;
    return elapsed;
}

int main(int argc, char** argv) {
    bool advanced = false;
    char* filename = NULL;
    for(int i = 1; i < argc; ++i) {
        if(strcmp(argv[i], "-o") == 0 && argc > i + 1) {
            filename = argv[i + 1];
            ++i;
        }
        else if(strcmp(argv[i], "-a") == 0 || strcmp(argv[i], "--advanced") == 0) {
            advanced = true;
        }
    }
    if(!filename) {
        fprintf(stderr, "You should provide the name of the output file.\n");
        exit(-1);
    }
    FILE* fd = fopen(filename, "w");
    if(!fd) {
        perror("Unable to open file");
        exit(-1);
    }
    srand(time(NULL));
    double worst_time = run(1, advanced);
    double result = (3.14 / MAX_DOTS) * sum;
    reset();
    fprintf(fd, "Result: %f\n", result); 
    for(int i = 2; i <= MAX_THREADS; ++i) {
        double time = run(i, advanced);
        double accel = time / worst_time;
        fprintf(fd, "%d:%f\n", i, accel);
        reset();
    }
    fclose(fd);
    return 0;
}

However, I can't see any real acceleration with increasing the number of threads (and it does not matter what check_dot() function I am using).但是，随着线程数的增加，我看不到任何真正的加速（并且我使用的 check_dot() 函数无关紧要）。 I have tried to execute this code on my laptop with Intel Core i7-3517u ( lscpu says that it has 4 independent CPUs) and it looks like the number of threads not really influences the execution time of my program:我试图在我的笔记本电脑上使用英特尔酷睿 i7-3517u（ lscpu说它有 4 个独立的 CPU）执行这段代码，看起来线程数量并没有真正影响我的程序的执行时间：

Number of threads: 1, working time: 0.847277 s
Number of threads: 2, working time: 3.133838 s
Number of threads: 3, working time: 2.331216 s
Number of threads: 4, working time: 3.011819 s
Number of threads: 5, working time: 3.086003 s
Number of threads: 6, working time: 3.118296 s
Number of threads: 7, working time: 3.058180 s
Number of threads: 8, working time: 3.114867 s
Number of threads: 9, working time: 3.179515 s
Number of threads: 10, working time: 3.025266 s
Number of threads: 11, working time: 3.142141 s
Number of threads: 12, working time: 3.064318 s

I supposed that it should be some kind of linear dependence between the execution time and number of working threads for at least four first values (the more threads are working the less is execution time), but here I have pretty equal time values.我认为它应该是至少四个第一个值的执行时间和工作线程数之间的某种线性相关性（工作线程越多，执行时间越少），但在这里我有相当相等的时间值。 Is it a real problem in my code or I am too demanding?这是我的代码中的真正问题还是我要求太高？

Answer 1

I was able to collect the timing / scaling measurements that you would desire with two changes to your code.我能够通过对代码进行两次更改来收集您想要的计时/缩放测量值。

First, rand() is not thread safe.首先， rand()不是线程安全的。 Replacing the calls with calls to rand_r(seed) in the advanced check_dots showed continual scaling as threads increased.用高级 check_dots 中的 rand_r(seed) 调用替换调用显示随着线程增加而持续扩展。 I think rand might have an internal lock that is serializing execution and preventing any speedup.我认为 rand 可能有一个内部锁，用于序列化执行并防止任何加速。 This change alone shows some scaling, from 1.23s -> 0.55 sec (5 threads).仅此更改就显示了一些缩放，从 1.23 秒 -> 0.55 秒（5 个线程）。

Second, I introduced barriers around the core execution region so that the cost of serially creating/joining threads and the malloc calls is not included.其次，我在核心执行区周围引入了障碍，因此不包括连续创建/加入线程和 malloc 调用的成本。 The core execution region shows good scaling, from 1.23sec -> 0.18sec (8 threads).核心执行区显示出良好的可伸缩性，从 1.23 秒 -> 0.18 秒（8 个线程）。

Code was compiled with gcc -O3 -pthread mcp.c -std=c11 -lm , run on Intel E3-1240 v5 (4 cores, HT), Linux 3.19.0-68-generic.代码使用gcc -O3 -pthread mcp.c -std=c11 -lm编译，在 Intel E3-1240 v5（4 核，HT）、Linux 3.19.0-68-generic 上运行。 Single measurements reported.报告了单次测量。

pthread_barrier_t bar;

void* check_dots_advanced(void* _iterations) {
    int* iterations = (int*)_iterations;
    double* res = (double*)malloc(sizeof(double));
    sem_wait(&sem);
    unsigned int seed = rand();
    sem_post(&sem);
    pthread_barrier_wait(&bar);
    for(int i = 0; i < *iterations; ++i) {
        double x = (double)(rand_r(&seed) % 314) / 100;
        double y = (double)(rand_r(&seed) % 100) / 100;
        if(y <= sin(x)) *res += x * y;
    }
    pthread_barrier_wait(&bar);
    pthread_exit((void*)res);
}

double run(int threads_num, bool advanced) {
    sem_init(&sem, 0, 1);
    struct timespec begin, end;
    double elapsed;
    pthread_t threads[threads_num];
    int iters = MAX_DOTS / threads_num;
    pthread_barrier_init(&bar, NULL, threads_num + 1); // barrier init
    for(int i = 0; i < threads_num; ++i) {
        if(!advanced) pthread_create(&threads[i], NULL, &check_dot, (void*)&iters);
        else pthread_create(&threads[i], NULL, &check_dots_advanced, (void*)&iters);
    }
    pthread_barrier_wait(&bar); // wait until threads are ready
    if(clock_gettime(CLOCK_REALTIME, &begin) == -1) { // begin time
        perror("Unable to get time");
        exit(-1);
    }

     pthread_barrier_wait(&bar); // wait until threads finish
     if(clock_gettime(CLOCK_REALTIME, &end) == -1) { // end time
        perror("Unable to get time");
        exit(-1);
    }
    for(int i = 0; i < threads_num; ++i) {
        if(!advanced) pthread_join(threads[i], NULL);
        else {
            void* tmp;
            pthread_join(threads[i], &tmp);
            sum += *((double*)tmp);
            free(tmp);
        }
    }
    pthread_barrier_destroy(&bar);

Answer 2

The problem you are experiencing is that the internal state of rand() is a shared resource between all threads, so the threads are going to serialise on access to rand() .您遇到的问题是rand()的内部状态是所有线程之间的共享资源，因此线程将在访问rand()进行序列化。

You need to use a pseudo-random number generator with per-thread state - the rand_r() function (although marked obsolete in the latest version of POSIX) can be used as such.您需要使用具有每个线程状态的伪随机数生成器 - rand_r()函数（尽管在最新版本的 POSIX 中标记为过时）可以这样使用。 For serious work you would be best off importing the implementation of some specific PRNG algorithm such as Mersenne Twister.对于认真的工作，您最好导入某些特定 PRNG 算法的实现，例如 Mersenne Twister。

pthreads 没有真正的加速

问题描述

2 个解决方案

解决方案1
0 2016-09-10 13:15:02

解决方案2
0 2016-09-10 13:23:11

pthreads 没有真正的加速

问题描述

2 个解决方案

解决方案1 0 2016-09-10 13:15:02

解决方案2 0 2016-09-10 13:23:11

解决方案1
0 2016-09-10 13:15:02

解决方案2
0 2016-09-10 13:23:11