多线程random_r比单线程版本慢

Question

The following program is essentially the same as the one described here . 以下程序与此处描述的程序基本相同。 When I run and compile the program using two threads (NTHREADS == 2), I get the following run times: 当我使用两个线程（NTHREADS == 2）运行并编译程序时，我得到以下运行时间：

real        0m14.120s
user        0m25.570s
sys         0m0.050s

When it is run with just one thread (NTHREADS == 1), I get run times significantly better even though it is only using one core. 当它只用一个线程（NTHREADS == 1）运行时，即使它只使用一个核心，我的运行时间也会明显更好。

real        0m4.705s
user        0m4.660s
sys         0m0.010s

My system is dual core, and I know random_r is thread safe and I am pretty sure it is non-blocking. 我的系统是双核的，我知道random_r是线程安全的，我很确定它是非阻塞的。 When the same program is run without random_r and a calculation of cosines and sines is used as a replacement, the dual-threaded version runs in about 1/2 the time as expected. 当没有random_r运行相同的程序并且使用余弦和正弦的计算作为替换时，双线程版本按预期运行大约1/2。

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define NTHREADS 2
#define PRNG_BUFSZ 8
#define ITERATIONS 1000000000

void* thread_run(void* arg) {
    int r1, i, totalIterations = ITERATIONS / NTHREADS;
    for (i = 0; i < totalIterations; i++){
        random_r((struct random_data*)arg, &r1);
    }
    printf("%i\n", r1);
}

int main(int argc, char** argv) {
    struct random_data* rand_states = (struct random_data*)calloc(NTHREADS, sizeof(struct random_data));
    char* rand_statebufs = (char*)calloc(NTHREADS, PRNG_BUFSZ);
    pthread_t* thread_ids;
    int t = 0;
    thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
    /* create threads */
    for (t = 0; t < NTHREADS; t++) {
        initstate_r(random(), &rand_statebufs[t], PRNG_BUFSZ, &rand_states[t]);
        pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t]);
    }
    for (t = 0; t < NTHREADS; t++) {
        pthread_join(thread_ids[t], NULL);
    }
    free(thread_ids);
    free(rand_states);
    free(rand_statebufs);
}

I am confused why when generating random numbers the two threaded version performs much worse than the single threaded version, considering random_r is meant to be used in multi-threaded applications. 我很困惑为什么在生成随机数时，两个线程版本的执行比单线程版本差得多，考虑到random_r意味着在多线程应用程序中使用。

Answer 1

A very simple change to space the data out in memory: 一个非常简单的更改来将数据空间分配到内存中：

struct random_data* rand_states = (struct random_data*)calloc(NTHREADS * 64, sizeof(struct random_data));
char* rand_statebufs = (char*)calloc(NTHREADS*64, PRNG_BUFSZ);
pthread_t* thread_ids;
int t = 0;
thread_ids = (pthread_t*)calloc(NTHREADS, sizeof(pthread_t));
/* create threads */
for (t = 0; t < NTHREADS; t++) {
    initstate_r(random(), &rand_statebufs[t*64], PRNG_BUFSZ, &rand_states[t*64]);
    pthread_create(&thread_ids[t], NULL, &thread_run, &rand_states[t*64]);
}

results in a much faster running time on my dual-core machine. 在我的双核机器上运行时间大大加快。

This would confirm the suspicion it was meant to test - that you are mutating values on the same cache line in two separate threads, and so have cache contention. 这将证实它本来要测试的怀疑 - 你是在两个独立的线程中改变同一缓存行上的值，因此有缓存争用。 Herb Sutter's 'machine architecture - what your programming language never told you' talk is worth watching if you've got the time if you don't know about that yet, he demonstrates false sharing starting at around 1:20. Herb Sutter的'机器架构 - 你的编程语言从未告诉过你'谈话是值得关注的，如果你有时间，如果你还不知道，他会在1:20左右开始虚假分享。

Work out your cache line size, and create each thread's data so it is aligned to it. 计算出缓存行大小，并创建每个线程的数据，使其与之对齐。

It's a bit cleaner to plonk all the thread's data into a struct, and align that: 将所有线程的数据压缩到结构中更加清晰，并对齐：

#define CACHE_LINE_SIZE 64

struct thread_data {
    struct random_data random_data;
    char statebuf[PRNG_BUFSZ];
    char padding[CACHE_LINE_SIZE - sizeof ( struct random_data )-PRNG_BUFSZ];
};

int main ( int argc, char** argv )
{
    printf ( "%zd\n", sizeof ( struct thread_data ) );

    void* apointer;

    if ( posix_memalign ( &apointer, sizeof ( struct thread_data ), NTHREADS * sizeof ( struct thread_data ) ) )
        exit ( 1 );

    struct thread_data* thread_states = apointer;

    memset ( apointer, 0, NTHREADS * sizeof ( struct thread_data ) );

    pthread_t* thread_ids;

    int t = 0;

    thread_ids = ( pthread_t* ) calloc ( NTHREADS, sizeof ( pthread_t ) );

    /* create threads */
    for ( t = 0; t < NTHREADS; t++ ) {
        initstate_r ( random(), thread_states[t].statebuf, PRNG_BUFSZ, &thread_states[t].random_data );
        pthread_create ( &thread_ids[t], NULL, &thread_run, &thread_states[t].random_data );
    }

    for ( t = 0; t < NTHREADS; t++ ) {
        pthread_join ( thread_ids[t], NULL );
    }

    free ( thread_ids );
    free ( thread_states );
}

with CACHE_LINE_SIZE 64: 使用CACHE_LINE_SIZE 64：

refugio:$ gcc -O3 -o bin/nixuz_random_r src/nixuz_random_r.c -lpthread
refugio:$ time bin/nixuz_random_r 
64
63499495
944240966

real    0m1.278s
user    0m2.540s
sys 0m0.000s

Or you can use double the cache line size, and use malloc - the extra padding ensures the mutated memory is on separate lines, as malloc is 16 (IIRC) rather than 64 byte aligned. 或者你可以使用两倍的缓存行大小，并使用malloc - 额外的填充确保变异的内存在不同的行上，因为malloc是16（IIRC）而不是64字节对齐。

(I reduced ITERATIONS by a factor of ten rather than having a stupidly fast machine) （我将ITERATIONS减少了十倍而不是一台愚蠢的机器）

Answer 2

I don't know if this is relevant or not - but i just saw a very similar behavior (order of magnitude slower with 2 threads than with one) ... I basically changed a: 我不知道这是否相关 - 但我只是看到一个非常相似的行为（2个线程比一个线程慢一个数量级）...我基本上改变了一个：

  srand(seed);
  foo = rand();

to a 到了

  myseed = seed;
  foo = rand_r(&myseed);

and that "fixed" it (2 threads is now reliably almost twice as fast - eg 19s instead of 35s). 并且“固定”它（2个线程现在可靠地几乎快两倍 - 例如19s而不是35s）。

I don't know what the issue could have been -- locking or cache coherence on the internals of rand() maybe? 我不知道问题是什么 - 锁定或缓存rand()内部的一致性可能吗？ Anyway, there is also a random_r() so maybe that would be of use to you (a year ago) or someone else. 无论如何，还有一个random_r()所以也许对你（一年前）或其他人有用。

多线程random_r比单线程版本慢

问题描述

2 个解决方案

解决方案1
13 已采纳 2010-06-08 19:30:44

解决方案2
1 2012-04-15 01:28:51

多线程random_r比单线程版本慢

问题描述

2 个解决方案

解决方案1 13 已采纳 2010-06-08 19:30:44

解决方案2 1 2012-04-15 01:28:51

解决方案1
13 已采纳 2010-06-08 19:30:44

解决方案2
1 2012-04-15 01:28:51