简体   繁体   中英

Getting Fatal error when calling MPI_Reduce inside a loop

I have a problem in this part of code (which is common between the tasks):

for (i = 0; i < m; i++) {
    // some code
    MPI_Reduce(&res, &mn, 1, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);
    // some code
}

This is working fine, but for large values of m I get this error:

    Fatal error in PMPI_Reduce: Other MPI error, error stack:
    PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
    MPIR_Reduce(764)..........................:
    MPIR_Reduce_binomial(207).................:
    MPIC_Send(41).............................:
    MPIC_Wait(513)............................:
    MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(436):
    MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
    
    job aborted:
    rank: node: exit code[: error message]
    0: AmirDiab: 1
    1: AmirDiab: 1
    2: AmirDiab: 1: Fatal error in PMPI_Reduce: Other MPI error, error stack:
    PMPI_Reduce(1198).........................: MPI_Reduce(sbuf=008FFC80, rbuf=008FFC8C, count=1, MPI_INT, MPI_MIN, root=0, MPI_COMM_WORLD) failed
    MPIR_Reduce(764)..........................:
    MPIR_Reduce_binomial(207).................:
    MPIC_Send(41).............................:
    MPIC_Wait(513)............................:
    MPIDI_CH3i_Progress_wait(215).............: an error occurred while handling an event returned by MPIDU_Sock_Wait()
    MPIDI_CH3I_Progress_handle_sock_event(436):
    MPIDI_CH3_PktHandler_EagerShortSend(306)..: Failed to allocate memory for an unexpected message. 261895 unexpected messages queued.
    3: AmirDiab: 1

Any advice?

You seem to be overtaxing MPI with your communication pattern. Note the 261895 unexpected messages queued error message. That's quite a lot of messages. As MPI tries to send data for small messages (like your single-element reductions) eagerly, running hundreds of thousands of MPI_Reduce calls in a loop can lead to resource exhaustion when too many messages are in flight.

If possible, try to re-arrange your algorithm so that you handle all m elements in a single reduction instead of iterating over them:

int* res = malloc(m * sizeof(int));
int* ms  = malloc(m * sizeof(int));

for (i = 0; i < m; ++i) {
    ms[i] = /* ... */
}

MPI_Reduce(res, ms, m, MPI_INT, MPI_MIN, 0, MPI_COMM_WORLD);

Alternatively, as suggested in the comments, you can add MPI_Barrier() calls every so often inside the loop to limit the number of outstanding messages.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM