pthread is slower than the “default” version

Question

SITUATION

I want to see the advantage of using pthread . If I'm not wrong: threads allow me to execute given parts of program in parallel.

so here is what I try to accomplish: I want to make a program that takes a number(let's say n ) and outputs the sum of [0..n] .

code

#define MAX 1000000000

int
main() {
    long long n = 0;
    for (long long i = 1; i < MAX; ++i)
        n += i;

    printf("\nn: %lld\n", n);
    return 0;
}

time: 0m2.723s

to my understanding I could simply take that number MAX and divide by 2 and let 2 threads do the job.

code

#define MAX          1000000000
#define MAX_THREADS  2
#define STRIDE       MAX / MAX_THREADS

typedef struct {
    long long off;
    long long res;
} arg_t;

void*
callback(void *args) {
    arg_t *arg = (arg_t*)args;

    for (long long i = arg->off; i < arg->off + STRIDE; ++i)
        arg->res += i;

    pthread_exit(0);
}

int
main() {
    pthread_t threads[MAX_THREADS];
    arg_t     results[MAX_THREADS];

    for (int i = 0; i < MAX_THREADS; ++i) {
        results[i].off = i * STRIDE;
        results[i].res = 0;

        pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
    }

    for (int i = 0; i < MAX_THREADS; ++i)
        pthread_join(threads[i], NULL);

    long long result;
    result = results[0].res;

    for (int i = 1; i < MAX_THREADS; ++i)
        result += results[i].res;

    printf("\nn: %lld\n", result);

    return 0;
}

time: 0m8.530s

PROBLEM

The version with pthread runs slower. Logically this version should run faster, but maybe creation of threads is more expensive.

Can someone suggest a solution or show what I'm doing/understanding wrong here?

Answer 1

Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).

The naive (-O0) code for

for (long long i = arg->off; i < arg->off + STRIDE; ++i)
    arg->res += i;

will access the memory of *arg . With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.

If you compile with -O1, the loop should use a register instead and only write to memory at the end. Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)

Another (better) option is to align arg_t on a cache line:

typedef struct {
    _Alignas(64) /*typical cache line size*/ long long off;
    long long res;
} arg_t;

Then you should get better performance with threads regardless of whether or not you turn optimization on.

Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory ).

Answer 2

Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers. The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway. Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.

Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete. It isn't a magic way to get more throughput.

pthread is slower than the “default” version

Question

SITUATION

PROBLEM

2 answers

solution1
2 ACCPTED 2020-09-14 15:06:28

solution2
0 2020-09-14 14:57:47

pthread is slower than the “default” version

Question

SITUATION

PROBLEM

2 answers

solution1 2 ACCPTED 2020-09-14 15:06:28

solution2 0 2020-09-14 14:57:47

solution1
2 ACCPTED 2020-09-14 15:06:28

solution2
0 2020-09-14 14:57:47