简体   繁体   中英

multi-process MPI vs. multithreaded std::thread performance

I wrote a simple test program to compare performance of parallelizing over multiple processes using MPI, or over multiple threads with std::thread . The work that is being parallelized is simply writing into a large array. What I'm seeing is that multi-process MPI outperforms multithreading by quite a wide margin.

The test code is:

#ifdef USE_MPI
#include <mpi.h>
#else
#include <thread>
#endif
#include <iostream>
#include <vector>

void dowork(int i){
    int n = 1000000000;
    std::vector<int> foo(n, -1);
}

int main(int argc, char *argv[]){
    int npar = 1;
#ifdef USE_MPI
    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &npar);
#else
    npar = 8;
    if(argc > 1){
        npar = atoi(argv[1]);
    }
#endif
    std::cout << "npar = " << npar << std::endl;

    int i;

#ifdef USE_MPI
    MPI_Comm_rank(MPI_COMM_WORLD, &i);
    dowork(i);
    MPI_Finalize();
#else
    std::vector<std::thread> threads;
    for(i = 0; i < npar; ++i){
        threads.emplace_back([i](){
            dowork(i);
        });
    }
    for(i = 0; i < npar; ++i){
        threads[i].join();
    }
#endif
    return 0;
}

The Makefile is:

partest_mpi:
    mpic++ -O2 -DUSE_MPI  partest.cpp -o partest_mpi -lmpi
partest_threads:
    c++ -O2 partest.cpp -o partest_threads -lpthread

And the results of execution are:

$ time ./partest_threads 8
npar = 8

real    0m2.524s
user    0m4.691s
sys 0m9.330s

$ time mpirun -np 8 ./partest_mpi
npar = 8
npar = 8
npar = 8
npar = 8
npar = 8
npar = 8
npar = 8npar = 8


real    0m1.811s
user    0m4.817s
sys 0m9.011s

So the question is, why is this happening, and what can I do on the threaded code to make it perform better? I'm guessing this has something to do with memory bandwidth and cache utilization. I'm running this on an Intel i9-9820X 10-core CPU.

TL;DR: make sure you have enough RAM and benchmark metrics are accurate. That being said, I am not able to reproduce such a difference on my machine (ie. I get identical performance results).

On most platforms, your code allocates 30 GB (since sizeof(int)=4 and each process/thread perform the allocation of the vector and items are initialized by the vector). Thus, you should first ensure you have at least enough RAM to do that. Otherwise data may be written to a (much slower) storage device (eg. SSD/HDD) due to memory swapping. Benchmarks are not really useful in such an extreme case (especially because result will likely be unstable).

Assuming you have enough RAM, your application is mostly bound by page-faults . Indeed, on most modern mainstream platforms, the operating system (OS) will allocate virtual memory very quickly, but it will not map it to physical memory directly. This mapping process is often done when a page is read/written for the first time (ie. page-fault) and is known to be slow . Moreover, for security reasons (eg. not to leak credentials of other processes), the most OS will zeroize each page when they are written for the first time, making page fault even slower. On some system, it may not scale well (although it should be fine on typical desktop machines with Windows/Linux/Mac). This part of the time is reported as system time .

The rest of the time is mainly the one required to fill the vector in RAM. This part barely scale on many platforms: generally 2-3 cores are clearly enough to saturate the RAM bandwidth on desktop machines.

That being said, on my machine, I am unable to reproduce the same outcome with 10 times less memory allocated (as I do not have 30 GB of RAM). The same apply for 4 times less memory. Actually, the MPI version is much slower on my Linux machine with a i7-9600KF. Note that results are relatively stable and reproducible (whatever the ordering and the number of run made):

time ./partest_threads 6 > /dev/null
real    0m0,188s
user    0m0,204s
sys 0m0,859s

time mpirun -np 6 ./partest_mpi > /dev/null
real    0m0,567s
user    0m0,365s
sys 0m0,991s

The bad result of the MPI version comes from the slow initialization of MPI runtime on my machine since a program performing nothing takes roughly 350 ms to be initialized. This actually shows the behavior is platform-dependent. At least, it shows that time should not be used to measure the performance of the two applications. One should use instead monotonic C++ clocks .

Once the code has been fixed to use an accurate timing method (with C++ clocks and MPI barriers), I get very close performance results between the two implementations (10 runs, with sorted timings):

pthreads:
Time: 0.182812 s
Time: 0.186766 s
Time: 0.187641 s
Time: 0.18785 s
Time: 0.18797 s
Time: 0.188256 s
Time: 0.18879 s
Time: 0.189314 s
Time: 0.189438 s
Time: 0.189501 s
Median time: 0.188 s

mpirun:
Time: 0.185664 s
Time: 0.185946 s
Time: 0.187384 s
Time: 0.187696 s
Time: 0.188034 s
Time: 0.188178 s
Time: 0.188201 s
Time: 0.188396 s
Time: 0.188607 s
Time: 0.189208 s
Median time: 0.188 s

For a deeper analysis on Linux, you can use the perf tool. A kernel-side profiling show that most of the time (60-80%) is spent in the kernel function clear_page_erms which zeroize pages during page-faults (as described before) followed by __memset_avx2_erms which fills the vector values. Other functions takes only a tiny fraction of overall run time. Here is an example with pthread:

  64,24%  partest_threads  [kernel.kallsyms]              [k] clear_page_erms
  18,80%  partest_threads  libc-2.31.so                   [.] __memset_avx2_erms
   2,07%  partest_threads  [kernel.kallsyms]              [k] prep_compound_page
   0,86%  :8444            [kernel.kallsyms]              [k] clear_page_erms
   0,82%  :8443            [kernel.kallsyms]              [k] clear_page_erms
   0,74%  :8445            [kernel.kallsyms]              [k] clear_page_erms
   0,73%  :8446            [kernel.kallsyms]              [k] clear_page_erms
   0,70%  :8442            [kernel.kallsyms]              [k] clear_page_erms
   0,69%  :8441            [kernel.kallsyms]              [k] clear_page_erms
   0,68%  partest_threads  [kernel.kallsyms]              [k] kernel_init_free_pages
   0,66%  partest_threads  [kernel.kallsyms]              [k] clear_subpage
   0,62%  partest_threads  [kernel.kallsyms]              [k] get_page_from_freelist
   0,41%  partest_threads  [kernel.kallsyms]              [k] __free_pages_ok
   0,37%  partest_threads  [kernel.kallsyms]              [k] _cond_resched
[...]  

If there is any intrinsec performance overhead of one of the two implementation, perf should be able to report it. If you are running on a Windows, you can use another profiling tool like VTune for example.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM