多线程 OpenBlas 会降低性能

Question

I have a driver cpp file that calls cblas_dgbmv function with proper arguments.我有一个驱动程序 cpp 文件，它使用适当的参数调用 cblas_dgbmv 函数。 When I build OpenBLAS with "make", dgbmv runs with 8 threads automatically (multithreaded dgbmv is invoked in gbmv.c interface and I assume this is a default behaviour).当我使用“make”构建 OpenBLAS 时，dgbmv 会自动运行 8 个线程（在 gbmv.c 接口中调用多线程 dgbmv，我认为这是默认行为）。 On the contrary, when I provide OPENBLAS_NUM_THREADS=1 after this build, sequential version runs and everything goes well.相反，当我在此构建后提供 OPENBLAS_NUM_THREADS=1 时，顺序版本运行并且一切顺利。 All good for now.目前一切都很好。

The problem is, I would like to assess performance of the multithreaded cblas_dgbmv based on different threads, by using a loop that calls this function 1000 times serially and measuring the time.问题是，我想评估基于不同线程的多线程 cblas_dgbmv 的性能，方法是使用一个循环调用此函数 1000 次并测量时间。 My driver is sequential.我的驱动程序是顺序的。 However, even 2 threaded dgbmv degrades the performance (execution time), being a single multithreaded call, without the loop.但是，即使是 2 线程 dgbmv 也会降低性能（执行时间），因为它是单个多线程调用，没有循环。

I researched about multithreaded runs of OpenBLAS and ensured everything conforms to specifications.我研究了 OpenBLAS 的多线程运行，并确保一切都符合规范。 There is no thread spawning or any pragma directives in my driver (it solely runs a master thread just to measure wall clock).我的驱动程序中没有线程生成或任何 pragma 指令（它仅运行一个主线程来测量挂钟）。 IN other words, I call DGBMV in a sequential region, not to conflict with threads of OpenBLAS.换句话说，我在顺序区域中调用 DGBMV，而不是与 OpenBLAS 的线程冲突。 However, I sense something like, excessive threads are running and therefore execution slows down, although, I have already set all env variables regarding #threads except OPENBLAS_NUM_THREADS to 1.但是，我感觉到类似，过多的线程正在运行，因此执行速度变慢，尽管我已经将所有关于#threads 的环境变量设置为 1，但 OPENBLAS_NUM_THREADS 除外。

I use openmp walll clock time and measure the execution time with a code surrounding only this 1000-times caller loop, so that is fine as well :我使用 openmp walll 时钟时间并使用仅围绕这个 1000 次调用者循环的代码测量执行时间，所以这也很好：

  double seconds,timing=0.0;
 //for(int i=0; i<10000; i++){
        seconds = omp_get_wtime ( );
        cblas_dgbmv(CblasColMajor, CblasNoTrans , n, n, kl, ku, alpha, B, lda, X, incx, beta, Y, incy);
        timing += omp_get_wtime ( ) - seconds;
   // }

I run my driver code with a proper env variable set in runtime (OPENBLAS_NUM_THREADS=4 ./myBinary args...).我使用在运行时设置的正确环境变量（OPENBLAS_NUM_THREADS=4 ./myBinary args...）运行我的驱动程序代码。 Here is my Makefile to compile both lbrary and the application :这是我的 Makefile 来编译图书馆和应用程序：

myBinary: myBinary.cpp
    cd ./xianyi-OpenBLAS-0b678b1 && make USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=4  &&  make PREFIX=/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1  install
    g++ myBinary.cpp -o myBinary -I/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/include/ -L/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -Wl,-rpath,/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -lopenblas -fopenmp -lstdc++fs -std=c++17

Architecture : 64 cores shared memory with AMD Opteron Processors架构：64 核与 AMD Opteron 处理器共享内存

I would be more than happy if anyone could explain what goes wrong with the multithreaded version of dgbmv.如果有人能解释 dgbmv 的多线程版本出了什么问题，我会非常高兴。

Answer 1

In my own program that scales well (different than multithreaded openblas mentioned above) i have tried setting GOMP_CPU_AFFINITY to 0..8 and PROC_BIND to true, and also OMP_PLACES to threads(8) for the sake of running 8 threads on the first 8 cpus (or cores) with no hyperthreading.在我自己的可扩展性良好的程序中（不同于上面提到的多线程 openblas），我尝试将 GOMP_CPU_AFFINITY 设置为 0..8 并将 PROC_BIND 设置为 true，并将 OMP_PLACES 设置为线程（8），以便在前 8 个 CPU 上运行 8 个线程（或核心）没有超线程。 then i have visually checked via htop utility every thread is being executed on the first numa node with 8 processors.然后我通过 htop 实用程序直观地检查了每个线程都在第一个具有 8 个处理器的 numa 节点上执行。 after ensuring that, result was 5 seconds longer.确保之后，结果增加了 5 秒。 by unsetting these variables, i got 5 secs faster result.通过取消设置这些变量，我得到了快 5 秒的结果。 @JérômeRichard. @JérômeRichard。 I2ll try the same thing for openblas driver as well.我也会为 openblas 驱动程序尝试同样的事情。

Answer 2

I have just tried what I have written in the other comment(settings for my own openmp program) for openblas.我刚刚尝试了我在 openblas 的其他评论（我自己的 openmp 程序的设置）中写的内容。 I have built the library with make USE_OPENMP=1 (as i stated its a sequential driver anyways) and num_threads=256 to set a maximum number.我已经使用 make USE_OPENMP=1 构建了库（正如我所说的它是一个顺序驱动程序）和 num_threads=256 来设置最大数量。 After I run openblas multithreaded, htop displays multiple threads running in the same numa node (eg first 8 cores), by which I arranged using environment variables proc_bind=true and places hardware threads.在我运行 openblas 多线程后，htop 显示在同一个 numa 节点（例如前 8 个内核）中运行的多个线程，我使用环境变量 proc_bind=true 进行排列并放置硬件线程。 However, even 1 call to multithreaded dgbmv is slower than sequential (1 thread version).但是，即使对多线程 dgbmv 的 1 次调用也比顺序（1 线程版本）慢。

Besides, In my system, Multithreaded OpenBlas threads are sleeping and running in turn (although in my own openmp parallel program all threads always in running state), and their CPU utilization is low, somewhere around 60%.此外，在我的系统中，多线程 OpenBlas 线程依次休眠和运行（尽管在我自己的 openmp 并行程序中，所有线程始终处于运行状态），它们的 CPU 利用率很低，大约为 60%。

screenshot of htop htop的截图

多线程 OpenBlas 会降低性能

问题描述

2 个解决方案

解决方案1
0 2022-06-23 14:29:41

解决方案2
0 2022-06-28 14:16:37

多线程 OpenBlas 会降低性能

问题描述

2 个解决方案

解决方案1 0 2022-06-23 14:29:41

解决方案2 0 2022-06-28 14:16:37

解决方案1
0 2022-06-23 14:29:41

解决方案2
0 2022-06-28 14:16:37