简体   繁体   中英

Open MPI Virtual Timer Expired

I'm using Open MPI 1.8 on Gentoo 3.13 to manage the data transfer from one program to another via a server/client concept. Both the server and the clients are launched via mpiexec as separate processes. After some days (this is quite a heavy computation...), I sometimes receive the error

mpiexec noticed that process rank 0 with PID 17213 on node XXX exited on signal 26 (Virtual timer expired).

Unfortunately, the error is not reproducible in a reliable way, ie, the error does not appear always and not always at the same point in the program flow. I also experienced this error on other machines. I already tracked the issue down to the ITIMER_VIRTUAL which, upon expiration, delivers SIGVTALRM (see, eg, http://man7.org/linux/man-pages/man2/setitimer.2.html ). In the BUGS section of the man page, it says that

Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.

I wonder if something similar might also hold for ITIMER_VIRTUAL ? Did anyone experience similar problems and can confirm the error?

The only workaround I can think of is to invoke setitimer(...) and try to manipulate the timer myself. However, I hope there is another way since I can't always modify the clients' source code. Any suggestions?

Since this question has not been answered officially, I will do it on behalf of Hristo (@HristoIliev: I hope this is ok for you). As was pointed out in the first comment to my question, there is not a single hint in the Open MPI source code which can have caused the virtual timer expiration. Indeed, the timer problem was related to a third-party library which made the code crash after an unpredictable time (depending on the current loading of the machine).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM