简体   繁体   中英

Multithreaded OpenBlas degrades performance

I have a driver cpp file that calls cblas_dgbmv function with proper arguments. When I build OpenBLAS with "make", dgbmv runs with 8 threads automatically (multithreaded dgbmv is invoked in gbmv.c interface and I assume this is a default behaviour). On the contrary, when I provide OPENBLAS_NUM_THREADS=1 after this build, sequential version runs and everything goes well. All good for now.

The problem is, I would like to assess performance of the multithreaded cblas_dgbmv based on different threads, by using a loop that calls this function 1000 times serially and measuring the time. My driver is sequential. However, even 2 threaded dgbmv degrades the performance (execution time), being a single multithreaded call, without the loop.

I researched about multithreaded runs of OpenBLAS and ensured everything conforms to specifications. There is no thread spawning or any pragma directives in my driver (it solely runs a master thread just to measure wall clock). IN other words, I call DGBMV in a sequential region, not to conflict with threads of OpenBLAS. However, I sense something like, excessive threads are running and therefore execution slows down, although, I have already set all env variables regarding #threads except OPENBLAS_NUM_THREADS to 1.

I use openmp walll clock time and measure the execution time with a code surrounding only this 1000-times caller loop, so that is fine as well :

  double seconds,timing=0.0;
 //for(int i=0; i<10000; i++){
        seconds = omp_get_wtime ( );
        cblas_dgbmv(CblasColMajor, CblasNoTrans , n, n, kl, ku, alpha, B, lda, X, incx, beta, Y, incy);
        timing += omp_get_wtime ( ) - seconds;
   // }

I run my driver code with a proper env variable set in runtime (OPENBLAS_NUM_THREADS=4 ./myBinary args...). Here is my Makefile to compile both lbrary and the application :

myBinary: myBinary.cpp
    cd ./xianyi-OpenBLAS-0b678b1 && make USE_THREAD=1 USE_OPENMP=0 NUM_THREADS=4  &&  make PREFIX=/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1  install
    g++ myBinary.cpp -o myBinary -I/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/include/ -L/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -Wl,-rpath,/home/selin/HPC-Research/xianyi-OpenBLAS-0b678b1/lib -lopenblas -fopenmp -lstdc++fs -std=c++17

Architecture : 64 cores shared memory with AMD Opteron Processors

I would be more than happy if anyone could explain what goes wrong with the multithreaded version of dgbmv.

In my own program that scales well (different than multithreaded openblas mentioned above) i have tried setting GOMP_CPU_AFFINITY to 0..8 and PROC_BIND to true, and also OMP_PLACES to threads(8) for the sake of running 8 threads on the first 8 cpus (or cores) with no hyperthreading. then i have visually checked via htop utility every thread is being executed on the first numa node with 8 processors. after ensuring that, result was 5 seconds longer. by unsetting these variables, i got 5 secs faster result. @JérômeRichard. I2ll try the same thing for openblas driver as well.

I have just tried what I have written in the other comment(settings for my own openmp program) for openblas. I have built the library with make USE_OPENMP=1 (as i stated its a sequential driver anyways) and num_threads=256 to set a maximum number. After I run openblas multithreaded, htop displays multiple threads running in the same numa node (eg first 8 cores), by which I arranged using environment variables proc_bind=true and places hardware threads. However, even 1 call to multithreaded dgbmv is slower than sequential (1 thread version).

Besides, In my system, Multithreaded OpenBlas threads are sleeping and running in turn (although in my own openmp parallel program all threads always in running state), and their CPU utilization is low, somewhere around 60%.

screenshot of htop

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM