简体   繁体   中英

Parallel Python-C++ program freezes (memory?)

I have a python program which features a C++ core python wrapped. It's written in parallel as it is computationally very expensive and I'm currently making it run on a server in remote on a Ubuntu 16.04 platform.

The problem I'm experiencing is that, at a certain number of cycles (let's say 2000) for my test case, it freezes abruptly without giving error messages or anything. I detected the part of the code where it stops and is a python function which doesn't feature any for cycle (so I assume it's not stuck into a loop). I tried to simply comment the function where it gets stuck out of the code, as it does minor calculations and now, at the exact same number of cycles, it gets stuck a little bit ahead, this time inside the C++ written part. I'm starting to assume that a possibility is some memory problem related to the server.

Doing htop from terminal when the code is stuck I can see the units involved in the computation are fully loaded as they are currently involved in some unknown calculations. Moreover, the memory involved in the process (at least when the process is already stuck) is not fully occupied so it may not be a RAM problem neither.

I also tried to reduce drastically the number of output written at every cycle (which, I admit, where consistent in size) but nothing. With the optimum number of processors it takes like 20 minutes to get to the critical point of 2000 cycles so the problem is not easy reproducible.

It is the first time I'm experiencing these sort of problems, Is there anything else I can do to highlight the issue?

Thanks for the answer

Here is something you could try. Write a code which checks which iterations is taking place and store all the variables at the start of the 2000th iteration.

Then use the same variable set to run the iteration again. It won't solve your problem, but it will help in reducing the testing time, and consequently the time it takes for you to find the problem.

If it is definitely a memory issue, the code will not get stuck at 2000 (That's where you start) but will get stuck at 4000.

Then you can monitor memory at 2000th iteration and replicate that.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM