简体   繁体   中英

Jupyter notebook : kernel restarts suddenly

I use Jupyter lab and Jupyter Notebook for my Deep Learning programs, so I made some long runs in order to train my models. But for some weeks, I've had recurrent kernel restarts after hours of training, which is very annoying. In addition, very few informations are given by the server console or by the browser log:

Jupyter-lab server log:

[I 2021-02-26 00:40:03.756 ServerApp] AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports
kernel 1330ee40-a826-44e2-9be9-f123deeaa1b2 restarted
[I 2021-02-26 00:40:04.070 ServerApp] Starting buffering for 1330ee40-a826-44e2-9be9-f123deeaa1b2:1b7fa111-f2d2-4804-bd90-c81e26562254
[I 2021-02-26 00:40:04.112 ServerApp] Restoring connection for 1330ee40-a826-44e2-9be9-f123deeaa1b2:1b7fa111-f2d2-4804-bd90-c81e26562254

I have the same problem when I use Jupyter-notebook instead of Jupyter-lab.

Various remarks:

  • The server and the client are not on the same machine, therefore I use ssh to connect to the server as described here .
  • I work under a corporation proxy
  • I use Tensorflow 2 for Deep Learning

If you want to be sure, you can run it in nohup mode (background process). It will run your jupyter notebook script on the remote server even if you are disconnected to it.

You can run in nohup mode by looking at this small tutorial:https://gist.github.com/33eyes/e1da2d78979dc059433849c466ff5996

Ok I thing I found the error's cause -> It was certainly a little memory leak in the code I was running which caused program crash after hundreds of epochs.

Somehow related: I have also seen this issue when using CUDA-DL libraries in Jupyter Notebooks created from Kubeflow:

AsyncIOLoopKernelRestarter: restarting kernel (1/5), keep random ports

In this case, it is due to a lack of memory assigned to the pod Kubeflow creates to deploy the Jupyter Server. The solution is as simple as allocating more memory (from ~4Gi) when creating the Notebook (and avoiding memory leaks in code:) )

Not an answer but some kind of troubleshooting:

  • Run notebook from the beginning but only until certain cell.
  • Walk up/down until you find where it fails.
  • If it wasn't failing before, think what have you added.

In my case, I added

test_eq(dill.pickles(var),True) # Checks that it can be pickled

the dill.pickles , outside/inside the test would keep causing the restart, but not necessarily when running that line, could be a few silly lines after. (weird, I know).

I checked the RAM and looked as expected.

Commenting it out resolved the issue. (Can't imagine why).

Now I test with

pp=dill.loads(dill.dumps(al_es))
test_eq(is_a_valid_object(pp),True)
# is_a_valid_object is my own function
# test_eq is from fastcore.test.test_eq

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM