嵌套循环MPI中的死锁（Python mpi4py）

Question

我不知道为什么这个嵌套循环MPI不会停止（即死锁）。 我知道大多数MPI用户都基于C ++ / C / Fortran，并且我在这里使用Python的mpi4py包，但是我怀疑这不是编程语言的问题，而是我对MPI本身的误解。

码

#!/usr/bin/env python3
# simple_mpi_run.py

from mpi4py import MPI 
import numpy as np 

comm = MPI.COMM_WORLD 
rank = comm.Get_rank() 
size = comm.Get_size() 
root_ = 0 

# Define some tags for MPI  
TAG_BLOCK_IDX = 1

num_big_blocks = 5

for big_block_idx in np.arange(num_big_blocks): 

    for worker_idx in (1+np.arange(size-1)): 
        if rank==root_: 
            # send to workers 
            comm.send(big_block_idx,
                    dest = worker_idx, 
                    tag = TAG_BLOCK_IDX) 
            print("This is big block", big_block_idx, 
                    "and sending to worker rank", worker_idx) 

        else:
            # receive from root_ 
            local_block_idx = comm.recv(source=root_, tag=TAG_BLOCK_IDX) 
            print("This is rank", rank, "on big block", local_block_idx)

批处理作业脚本

运行以上操作的SGE批处理作业脚本。 出于说明目的，我使用-np 3仅将三个进程分配给mpirun 。 在实际的应用程序中，我将使用不止三个。

#!/bin/bash

# batch_job.sh

#$ -S /bin/bash 
#$ -pe mpi 3
#$ -cwd
#$ -e error.log
#$ -o stdout.log
#$ -R y

MPIPATH=/usr/lib64/openmpi/bin/

PYTHONPATH=$PYTHONPATH:/usr/local/lib/python3.6/site-packages/:/usr/bin/
export PYTHONPATH

PATH=$PATH:$MPIPATH
export PATH

LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/:/usr/lib64/ 
export LD_LIBRARY_PATH

mpirun -v -np 3 python3 simple_mpi_run.py

产量

从stdout.log ，运行qsub batch_job.sh后，我看到以下输出：

This is big block 0 and sending to worker rank 1
This is rank 1 on big block 0
This is big block 0 and sending to worker rank 2
This is big block 1 and sending to worker rank 1
This is rank 1 on big block 1
This is big block 1 and sending to worker rank 2
This is big block 2 and sending to worker rank 1
This is rank 1 on big block 2
This is big block 2 and sending to worker rank 2
This is big block 3 and sending to worker rank 1
This is rank 1 on big block 3
This is big block 3 and sending to worker rank 2
This is big block 4 and sending to worker rank 1
This is rank 1 on big block 4
This is big block 4 and sending to worker rank 2
This is rank 2 on big block 0
This is rank 2 on big block 1
This is rank 2 on big block 2
This is rank 2 on big block 3
This is rank 2 on big block 4

问题

据我所知，这是我预期的正确输出。 但是，当我运行qstat ，我可以看到作业状态保持在r ，表明该作业没有完成，即使我有所需的输出也是如此。 因此，我怀疑这是一个MPI死锁问题，但是尽管经过数小时的修补，我仍然看不到死锁问题。 任何帮助将非常感激！

编辑

删除了代码中与当前死锁问题无关的一些注释块。

Answer 1

挂起的根本原因是您交换了第二个for循环和if子句：非root级别仅应从master接收一次。

话虽这么说，您宁愿使用MPI集合MPI_Bcast()而不是重新发明轮子。

这是程序的重写版本

#!/usr/bin/env python3
# simple_mpi_run.py

from mpi4py import MPI 
import numpy as np 

comm = MPI.COMM_WORLD 
rank = comm.Get_rank() 
size = comm.Get_size() 
root_ = 0 

# Define some tags for MPI  
TAG_BLOCK_IDX = 1

num_big_blocks = 5

for big_block_idx in np.arange(num_big_blocks): 

    if rank==root_: 
        for worker_idx in (1+np.arange(size-1)): 
            # send to workers 
            comm.send(big_block_idx,
                    dest = worker_idx, 
                    tag = TAG_BLOCK_IDX) 
            print("This is big block", big_block_idx, 
                    "and sending to worker rank", worker_idx) 

    else:
        # receive from root_ 
        local_block_idx = comm.recv(source=root_, tag=TAG_BLOCK_IDX) 
        print("This is rank", rank, "on big block", local_block_idx)

这是一个使用MPI_Bcast() MPI'ish版本

#!/usr/bin/env python3
# simple_mpi_run.py

from mpi4py import MPI 
import numpy as np 

comm = MPI.COMM_WORLD 
rank = comm.Get_rank() 
root_ = 0 

num_big_blocks = 5

for big_block_idx in np.arange(num_big_blocks): 

    local_block_idx = comm.bcast(big_block_idx, root=root_)

    if rank==root_: 
            print("This is big block", big_block_idx, 
                    "and broadcasting to all worker ranks")
    else:
        print("This is rank", rank, "on big block", local_block_idx)

嵌套循环MPI中的死锁（Python mpi4py）

问题描述

1 个解决方案

解决方案1
0 已采纳 2018-08-29 04:52:24

嵌套循环MPI中的死锁（Python mpi4py）

问题描述

1 个解决方案

解决方案1 0 已采纳 2018-08-29 04:52:24

解决方案1
0 已采纳 2018-08-29 04:52:24