简体   繁体   中英

Jupyter: The kernel appears to have died. It will restart automatically. (Keras Related)

I'm trying to train a Re.net50 but failing no matter what I do since the Jupyter notebook's Kernel is dying ( The kernel appears to have died. It will restart automatically ), the moment it starts training (Epoch 1/100). I have GeForce GTX 1060 Ti, and when I do nvidia-smi during the training (which lasts 1 sec though) I only see 80 MB of memory being allocated compared to the past, and then the Kernel dies, as if it tries but it fails.

Here are the requirements:


which I satisfy. Here is how I set up the training session:

config = tf.ConfigProto()
config.gpu_options.allow_growth = False
config.gpu_options.per_process_gpu_memory_fraction = 0.9
sess = tf.Session(config=config) 

os.environ["CUDA_VISIBLE_DEVICES"] = '0' #yes, this is the ID of my GPU.

# create the FCN model
model_mobilenet = ResNet50(input_shape=(1024, 1024, 3), include_top=False) # use the Resnet
model_x8_output = Conv2D(128, (1, 1), activation='relu')(model_mobilenet.layers[-95].output)
model_x8_output = UpSampling2D(size=(8, 8))(model_x8_output)
model_x8_output = Conv2D(3, (3, 3), padding='same', activation='sigmoid')(model_x8_output)
MODEL_x8 = Model(inputs=model_mobilenet.input, outputs=model_x8_output)

MODEL_x8.compile(loss='binary_crossentropy', optimizer=Adam(lr=1e-3), metrics=[jaccard_distance])

MODEL_x8.fit_generator(train_generator, steps_per_epoch=300, epochs=100, verbose=1, validation_data=val_generator, validation_steps=10)

Epoch 1/100
  1/300 [..............................] - ETA: 1:01:59 - loss: 0.7193 - jaccard_distance: 0.1125

I have tried setting:

  • config.gpu_options.allow_growth to True .
  • config.gpu_options.per_process_gpu_memory_fraction to any other arbitrary value such as 0.1
  • commenting out: #os.environ["CUDA_VISIBLE_DEVICES"] = 0

none of them worked. I appreciate constructive answers.

Thanks in advance.

EDIT: I now tried running this as a script (not as a notebook) and the moment Tensorflow session line comes up, terminal throws the following:

2020-01-28 13:44:55.756819: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757047: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757313: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757736: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.757940: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Could not dlopen library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/username/ros_ws/devel/lib:/opt/ros/melodic/lib
2020-01-28 13:44:55.808416: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
2020-01-28 13:44:55.808444: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1663] Cannot dlopen some GPU libraries. Skipping registering GPU devices...

which is strange because I don't have CUDA 10, rather 9.0, so this should not even be asked. Is my Tensorflow version wrong?

Most possibly this is because there is not enough memory to store the data/model. Your input image size is also 1024x1024. I would siggest you to try training with a small image size like 256 or even 128, just to see if it is at least working.

Also, is your GPU being detected by TF?

Okay, got it.

The problem was my tensorflow=gpu version (1.14) which was not compatible with my CUDA version (9.0). I had to install a version that is lower than 1.13. But that's not the only catch. My CuDNN version (705) was also problematic, I had to reduce my Tensorflow-gpu all the way down to 1.9.0.

Now everything works.

In my case (windows 10, rtx 3050 ti GPU with vram of 4 GB), "The kernel appears to have died" error has been resolved by uninstalling CUDA 11 (and its relevant cuDNN) and installing CUDA 10.1 (and cuDNN 2.2.0) as well as uninstalling tensorflow-gpu 2.3.0 and installing tesorflow-gpu 2.2.0 (python 3.8 worked for me while tensorflow website had been tested python 3.5, so I did not downgrade python). However, I am not satisfied with the result as my GPU takes too long to make models compared to my core-i7 intel CPU.

In a word, this error seems to be related to incompatibility of GPU and CUDA version which can be fixed by downgrading CUDA and installing relevant counterparts according to new CUDA.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM