简体   繁体   中英

Why is `total_num_virtual_procs` not equal to the amount of MPI processes?

In the NEST simulator there is the concept of virtual processes. Reading the information on virtual processes I would expect every MPI process to contain at least 1 virtual process, otherwise that MPI process isn't doing anything?

However, when I start 4 MPI processes the kernel status attribute total_num_virtual_procs is 1 :

mpiexec -n 4 python -c "import nest; import mpi4py.MPI; print(nest.GetKernelStatus()['total_num_virtual_procs'], mpi4py.MPI.COMM_WORLD.Get_size());"

This prints the NEST import text and 1 4 four times. Does this mean 3 processes aren't going to be used for the simulation until I do nest.SetKernelStatus({'total_num_virtual_procs': 4}) ?

EDIT: TL;DR: The return value of nest.GetKernelStatus('total_num_virtual_procs') was buggy in former NEST versions. Recent versions show the correct number, which by default is one thread per process, so the number of MPI processes.

The number of virtual processes is a free parameter of NEST because it uses a hybrid parallelization scheme with MPI + OpenMP. You may have multiple threads per process, each being its own virtual process, eg two processes and four VPs leads to two threads per process:

Process  Thread  VP
-------  ------  --
0        0       0
1        0       1
0        1       2
1        1       3

Setting total_num_virtual_procs to eight, would produce four threads per process, and so on. Your above example works even without mpi4py like this:

mpiexec -n 2 python -c "\
   import nest; \
   nest.SetKernelStatus({'total_num_virtual_procs': 4}); \
   print('>>> this is process %d of %d with %d threads <<<' \
         % ( nest.Rank(),
             nest.NumProcesses(), \
             nest.GetKernelStatus()['total_num_virtual_procs']/nest.NumProcesses()) \
   ); \
   nest.Simulate(10);"

It has following lines among its output:

…

>>> this is process 1 of 2 with 2 threads <<<
>>> this is process 0 of 2 with 2 threads <<<
…

Sep 09 15:49:39 SimulationManager::start_updating_ [Info]: 
    Number of local nodes: 0
    Simulation time (ms): 10
    Number of OpenMP threads: 2
    Number of MPI processes: 2

Sep 09 15:49:39 SimulationManager::start_updating_ [Info]: 
    Number of local nodes: 0
    Simulation time (ms): 10
    Number of OpenMP threads: 2
    Number of MPI processes: 2

You can see that the total_num_virtual_procs is split over all processes, such that Number of OpenMP threads times Number of MPI processes equals total_num_virtual_procs . Further you note, that you don't see the thread parallelization here on the Python level, since the processes only enter parallel context in Create() , Connect() and Simulate() calls in the C++ scope below.

  • If you don't set total_num_virtual_procs the default is one thread per process . You can see this by just creating a number of neurons nest.Create('iaf_psc_exp', 10) on for example two processes:

     Sep 09 16:28:28 SimulationManager::start_updating_ [Info]: Number of local nodes: 5 Number of local nodes: 5 Simulation time (ms): 10 Simulation time (ms): 10 Number of OpenMP threads: 1 Number of OpenMP threads: 1 Number of MPI processes: 2 Number of MPI processes: 2

    Each process handles five of the ten created neurons. ( nest.GetKernelStatus('total_num_virtual_procs') should return the number of processes then. Which NEST version are you using? This was already fixed…)

  • If the number of VPs you want to set is not a multiple of the MPI processes NEST throws an exception.

     nest.lib.hl_api_exceptions.BadProperty: ('BadProperty in SetKernelStatus: Number of virtual processes (threads*processes) must be an integer multiple of the number of processes. Value unchanged.', 'BadProperty', 'SetKernelStatus', ': Number of virtual processes (threads*processes) must be an integer multiple of the number of processes. Value unchanged.')`

A generally good starting point when experimenting with different geometries of your jobs is one MPI process per NUMA domain (eg one process per physical cpu socket) and one thread per physical core (hyper-threading may cause a fight for the cache lines which may even degrade performance).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM