I'm using Open MPI 1.8 on Gentoo 3.13 to manage the data transfer from one program to another via a server/client concept. Both the server and the clients are launched via mpiexec
as separate processes. After some days (this is quite a heavy computation...), I sometimes receive the error
mpiexec noticed that process rank 0 with PID 17213 on node XXX exited on signal 26 (Virtual timer expired).
Unfortunately, the error is not reproducible in a reliable way, ie, the error does not appear always and not always at the same point in the program flow. I also experienced this error on other machines. I already tracked the issue down to the ITIMER_VIRTUAL
which, upon expiration, delivers SIGVTALRM
(see, eg, http://man7.org/linux/man-pages/man2/setitimer.2.html ). In the BUGS section of the man page, it says that
Under very heavy loading, an ITIMER_REAL timer may expire before the signal from a previous expiration has been delivered. The second signal in such an event will be lost.
I wonder if something similar might also hold for ITIMER_VIRTUAL
? Did anyone experience similar problems and can confirm the error?
The only workaround I can think of is to invoke setitimer(...)
and try to manipulate the timer myself. However, I hope there is another way since I can't always modify the clients' source code. Any suggestions?
Since this question has not been answered officially, I will do it on behalf of Hristo (@HristoIliev: I hope this is ok for you). As was pointed out in the first comment to my question, there is not a single hint in the Open MPI source code which can have caused the virtual timer expiration. Indeed, the timer problem was related to a third-party library which made the code crash after an unpredictable time (depending on the current loading of the machine).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.