使用多线程时低于预期的加速

Question

Remark: I feel a little bit stupid about this, but this might help someone 备注：我对此有些愚蠢，但这可能对某人有帮助

So, I am trying to improve the performance of a program by using parallelism. 因此，我正在尝试通过使用并行性来提高程序的性能。 However, I am encountering an issue with the measured speedup. 但是，我遇到了测得的加速问题。 I have 4 CPUs: 我有4个CPU：

~% lscpu
...
CPU(s):                4
...

However, the speedup is much lower than fourfold. 但是，提速比四倍要低得多。 Here is a minimal working example, with a sequential version, a version using OpenMP and a version using POSIX threads (to be sure it is not due to either implementation). 这是一个最小的工作示例，具有顺序版本，使用OpenMP的版本和使用POSIX线程的版本（请确保不是由于任何一种实现方式）。

Purely sequential ( add_seq.c ): 纯顺序（ add_seq.c ）：

#include <stddef.h>

int main() {
    for (size_t i = 0; i < (1ull<<36); i += 1) {
        __asm__("add $0x42, %%eax" : : : "eax");
    }
    return 0;
}

OpenMP ( add_omp.c ): OpenMP（ add_omp.c ）：

#include <stddef.h>

int main() {
    #pragma omp parallel for schedule(static)
    for (size_t i = 0; i < (1ull<<36); i += 1) {
        __asm__("add $0x42, %%eax" : : : "eax");
    }
    return 0;
}

POSIX threads ( add_pthread.c ): POSIX线程（ add_pthread.c ）：

#include <pthread.h>
#include <stddef.h>

void* f(void* x) {
    (void) x;
    const size_t count = (1ull<<36) / 4;
    for (size_t i = 0; i < count; i += 1) {
        __asm__("add $0x42, %%eax" : : : "eax");
    }
    return NULL;
}
int main() {
    pthread_t t[4];
    for (size_t i = 0; i < 4; i += 1) {
        pthread_create(&t[i], NULL, f, NULL);
    }
    for (size_t i = 0; i < 4; i += 1) {
        pthread_join(t[i], NULL);
    }
    return 0;
}

Makefile: 生成文件：

CFLAGS := -O3 -fopenmp
LDFLAGS := -O3 -lpthread  # just to be sure

all: add_seq add_omp add_pthread

So, now, running this (using zsh's time builtin): 因此，现在，运行此命令（使用zsh的内置时间）：

% make -B && time ./add_seq && time ./add_omp && time ./add_pthread
cc -O3 -fopenmp  -O3 -lpthread    add_seq.c   -o add_seq
cc -O3 -fopenmp  -O3 -lpthread    add_omp.c   -o add_omp
cc -O3 -fopenmp  -O3 -lpthread    add_pthread.c   -o add_pthread
./add_seq  24.49s user 0.00s system 99% cpu 24.494 total
./add_omp  52.97s user 0.00s system 398% cpu 13.279 total
./add_pthread  52.92s user 0.00s system 398% cpu 13.266 total

Checking CPU frequency, sequential code has maximum CPU frequency of 2.90 GHz, and parallel code (all versions) has uniform CPU frequency of 2.60 GHz. 检查CPU频率，顺序代码的最大CPU频率为2.90 GHz，并行代码（所有版本）的统一CPU频率为2.60 GHz。 So counting billions of instructions: 因此，数以十亿计的指令：

>>> 24.494 * 2.9
71.0326
>>> 13.279 * 2.6
34.5254
>>> 13.266 * 2.6
34.4916

So, all in all, threaded code is only running twice as fast as sequential code, although it is using four times as much CPU time. 因此，总的来说，尽管线程代码使用的CPU时间是四倍，但其运行速度仅为顺序代码的两倍。 Why is it so? 为什么会这样呢？

Remark: assembly for asm_omp.c seemed less efficient, since it did the for-loop by incrementing a register, and comparing it to the number of iterations, rather than decrementing and directly checking for ZF; 注释： asm_omp.c 似乎效率较低，因为它通过增加寄存器并将其与迭代次数进行比较而不是递减并直接检查ZF来进行for循环； however, this had no effect on performance 但是，这对性能没有影响

Answer 1

Well, the answer is quite simple: there are really only two CPU cores: 答案很简单：实际上只有两个CPU内核：

% lscpu
...
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
...

So, although htop shows four CPUs, two are virtual and only there because of hyperthreading . 因此，尽管htop显示了四个CPU，但其中两个是虚拟的，并且仅由于超线程而存在。 Since the core idea of hyper-threading is of sharing resources of a single core in two processes, it does help run similar code faster (it is only useful when running two threads using different resources). 由于超线程的核心思想是在两个进程中共享单个内核的资源，因此它确实有助于更快地运行相似的代码（仅在使用不同资源运行两个线程时才有用）。

So, in the end, what happens is that time/ clock() measures the usage of each logical core as that of the underlying physical core. 因此，最后，发生的事情是time / clock（）测量了每个逻辑核心与底层物理核心的使用情况。 Since all report ~100% usage, we get a ~400% usage, although it only represents a twofold speedup. 由于所有报告的使用率约为100％，因此我们获得了约400％的使用率，尽管它仅表示两倍的提速。

Up until then, I was convinced this computer contained 4 physical cores, and had completely forgotten to check about hyperthreading. 在此之前，我一直坚信这台计算机包含4个物理内核，并且完全忘记了有关超线程的检查。

Similar question 类似问题
Related question 相关问题

使用多线程时低于预期的加速

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-09-16 03:43:43

使用多线程时低于预期的加速

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-09-16 03:43:43

解决方案1
3 已采纳 2016-09-16 03:43:43