pthread 比“默认”版本慢

Question

SITUATION情况

I want to see the advantage of using pthread .我想看看使用pthread的优势。 If I'm not wrong: threads allow me to execute given parts of program in parallel.如果我没有错：线程允许我并行执行程序的给定部分。

so here is what I try to accomplish: I want to make a program that takes a number(let's say n ) and outputs the sum of [0..n] .所以这就是我想要完成的：我想制作一个程序，它接受一个数字（假设n ）并输出[0..n]的总和。

code代码

#define MAX 1000000000

int
main() {
    long long n = 0;
    for (long long i = 1; i < MAX; ++i)
        n += i;

    printf("\nn: %lld\n", n);
    return 0;
}

time: 0m2.723s时间：0m2.723s

to my understanding I could simply take that number MAX and divide by 2 and let 2 threads do the job.据我所知，我可以简单地将这个数字MAX除以2然后让 2 个threads完成这项工作。

code代码

#define MAX          1000000000
#define MAX_THREADS  2
#define STRIDE       MAX / MAX_THREADS

typedef struct {
    long long off;
    long long res;
} arg_t;

void*
callback(void *args) {
    arg_t *arg = (arg_t*)args;

    for (long long i = arg->off; i < arg->off + STRIDE; ++i)
        arg->res += i;

    pthread_exit(0);
}

int
main() {
    pthread_t threads[MAX_THREADS];
    arg_t     results[MAX_THREADS];

    for (int i = 0; i < MAX_THREADS; ++i) {
        results[i].off = i * STRIDE;
        results[i].res = 0;

        pthread_create(&threads[i], NULL, callback, (void*)&results[i]);
    }

    for (int i = 0; i < MAX_THREADS; ++i)
        pthread_join(threads[i], NULL);

    long long result;
    result = results[0].res;

    for (int i = 1; i < MAX_THREADS; ++i)
        result += results[i].res;

    printf("\nn: %lld\n", result);

    return 0;
}

time: 0m8.530s时间：0m8.530s

PROBLEM问题

The version with pthread runs slower.带有pthread的版本运行速度较慢。 Logically this version should run faster, but maybe creation of threads is more expensive.从逻辑上讲，这个版本应该运行得更快，但也许创建线程的成本更高。

Can someone suggest a solution or show what I'm doing/understanding wrong here?有人可以提出解决方案或在这里展示我在做什么/理解错误吗？

Answer 1

Your problem is cache thrashing combined with a lack of optimization (I bet you're compiling without it on).您的问题是缓存抖动加上缺乏优化（我敢打赌，您在编译时没有打开它）。

The naive (-O0) code for天真的 (-O0) 代码

for (long long i = arg->off; i < arg->off + STRIDE; ++i)
    arg->res += i;

will access the memory of *arg .将访问*arg的内存。 With your results array being defined the way it is, that memory is very close to the memory of the next arg and the two threads will fight for the same cache-line, making RAM caching very ineffective.由于您的results数组的定义方式如此，该内存非常接近下一个 arg 的内存，两个线程将争夺相同的缓存行，从而使 RAM 缓存非常低效。

If you compile with -O1, the loop should use a register instead and only write to memory at the end.如果使用 -O1 进行编译，则循环应改为使用寄存器，并且仅在最后写入内存。 Then, you should get better performance with threads (higher optimization levels on gcc seem to optimize the loop out completely)然后，您应该使用线程获得更好的性能（gcc 上更高的优化级别似乎可以完全优化循环）

Another (better) option is to align arg_t on a cache line:另一个（更好）的选择是在缓存行上对齐arg_t ：

typedef struct {
    _Alignas(64) /*typical cache line size*/ long long off;
    long long res;
} arg_t;

Then you should get better performance with threads regardless of whether or not you turn optimization on.那么无论是否打开优化，您都应该通过线程获得更好的性能。

Good cache utilization is generally very important in multithreaded programming (and Ulrich Drepper has much to say on that topic in his infamous What Every Programmer Should Know About Memory ).在多线程编程中，良好的缓存利用率通常非常重要（Ulrich Drepper 在他臭名昭著的 “每个程序员应该知道的内存”中对此主题有很多话要说）。

Answer 2

Creating a whole bunch of threads is very unlikely to be quicker than simply adding numbers.创建一大堆线程不太可能比简单地添加数字更快。 The CPU can add an awfully large number of integers in the time it takes the kernel to set up and tear down a thread. CPU 可以在内核设置和拆除线程所需的时间内添加大量整数。 To see the benefit of multithreading, you really need each thread to be doing a significant task -- significant compared to the overhead in creating the thread, anyway.要看到多线程的好处，您确实需要每个线程都执行一项重要的任务 —— 无论如何，与创建线程的开销相比，这是显着的。 Alternatively, you need to keep a pool of threads running, and assign them work according to some allocation strategy.或者，您需要保持线程池运行，并根据某种分配策略分配它们工作。

Multi-threading works best when an application consists of tasks that are somewhat independent, that would otherwise be waiting on one another to complete.当应用程序由一些独立的任务组成时，多线程最有效，否则这些任务将相互等待完成。 It isn't a magic way to get more throughput.这不是获得更多吞吐量的神奇方法。

pthread 比“默认”版本慢

问题描述

SITUATION情况

PROBLEM问题

2 个解决方案

解决方案1
2 已采纳 2020-09-14 15:06:28

解决方案2
0 2020-09-14 14:57:47

pthread 比“默认”版本慢

问题描述

SITUATION情况

PROBLEM问题

2 个解决方案

解决方案1 2 已采纳 2020-09-14 15:06:28

解决方案2 0 2020-09-14 14:57:47

解决方案1
2 已采纳 2020-09-14 15:06:28

解决方案2
0 2020-09-14 14:57:47