I use Jupyter lab and Jupyter Notebook for my Deep Learning programs, so I made some long runs in order to train my models. But for some weeks, I've had recurrent kernel restarts after hours of training, which is very annoying. In addition, very few informations are given by the server console or by the browser log:
Jupyter-lab server log:
[I 2021-02-26 00:40:03.756 ServerApp] AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports
kernel 1330ee40-a826-44e2-9be9-f123deeaa1b2 restarted
[I 2021-02-26 00:40:04.070 ServerApp] Starting buffering for 1330ee40-a826-44e2-9be9-f123deeaa1b2:1b7fa111-f2d2-4804-bd90-c81e26562254
[I 2021-02-26 00:40:04.112 ServerApp] Restoring connection for 1330ee40-a826-44e2-9be9-f123deeaa1b2:1b7fa111-f2d2-4804-bd90-c81e26562254
I have the same problem when I use Jupyter-notebook instead of Jupyter-lab.
Various remarks:
If you want to be sure, you can run it in nohup mode (background process). It will run your jupyter notebook script on the remote server even if you are disconnected to it.
You can run in nohup mode by looking at this small tutorial:https://gist.github.com/33eyes/e1da2d78979dc059433849c466ff5996
Ok I thing I found the error's cause -> It was certainly a little memory leak in the code I was running which caused program crash after hundreds of epochs.
Somehow related: I have also seen this issue when using CUDA-DL libraries in Jupyter Notebooks created from Kubeflow:
AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports
In this case, it is due to a lack of memory assigned to the pod Kubeflow creates to deploy the Jupyter Server. The solution is as simple as allocating more memory (from ~4Gi) when creating the Notebook (and avoiding memory leaks in code:) )
Not an answer but some kind of troubleshooting:
In my case, I added
test_eq(dill.pickles(var),True) # Checks that it can be pickled
the dill.pickles
, outside/inside the test would keep causing the restart, but not necessarily when running that line, could be a few silly lines after. (weird, I know).
I checked the RAM and looked as expected.
Commenting it out resolved the issue. (Can't imagine why).
Now I test with
pp=dill.loads(dill.dumps(al_es))
test_eq(is_a_valid_object(pp),True)
# is_a_valid_object is my own function
# test_eq is from fastcore.test.test_eq
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.