简体   繁体   中英

"Kernel appears to have died" error when I try to train my model. Is it too big? What could be the issue?

None of the other solutions to this question on here have worked for me.

I am trying to train a model on a Jupiter notebook on amazon Sagemaker instance ml.c4.8xlarge with 15,000GB memory. However, this error keeps coming up when I run train. The model is as follows:

model = Sequential()
model.add(Embedding(vocab_size, 200, input_length=max_length))
model.add(GRU(units=400,  dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(800, activation='sigmoid'))
model.add(Dense(400, activation='sigmoid'))
model.add(Dense(200, activation='sigmoid'))
model.add(Dense(100, activation='sigmoid'))
model.add(Dense(50, activation='sigmoid'))
model.add(Dense(20, activation='sigmoid'))
model.add(Dense(10, activation='sigmoid'))
model.add(Dense(5, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))

With the following summary: 在此处输入图像描述

Is the model too big? Do I not have enough space to run it or could it be some other issue?

If you have a CPU-only host, then you should take into account not only the size of the model but also the amount of RAM occupied with your data and all variables in the Jupyter notebook. As you probably know, all these variables sit here until the kernel is restarted.

For example, if you load your dataset like this:

data = load_train_data(path)
index = np.arange(len(data))
trn_idx, val_idx = train_test_split(index, test_size=0.2)
# here we making a copy
trn_df, val_df = data.loc[trn_idx], data.loc[val_idx]

Then all these variables occupy some space in your RAM. You can try to free some memory using del and explicit garbage collector calls.

import gc
del data
gc.collect()

This example is made up but I guess you've got the idea. Also, you can try to monitor your memory usage with free command.

$ watch -n 0.1 free -mh

Then you can debug your notebook to see when the memory gets out of the limits. Generally speaking, having huge datasets and making (probably unintentional) copies of the data could easily occupy dozens GB of RAM.


Even if you have a GPU installed into your machine, the data should be loaded into the RAM before it can get into GPU memory. So one always needs to track the amount of memory that is still available.

You could also check this package that helps to automate the garbage collection process a bit. AFAIK, the package supports pytorch only but is developed with other backends in mind. Probably you can adapt the ideas behind this package for your own needs.

I had the same issue, and eventually... using an EC2 with more memory solved this issue.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM