简体   繁体   中英

Python Keras code out of memory for no apparent reason

Consider the following code that works with a Keras Sequential model on the CIFAR-10 data set. Background is given at the end of the post:

import tensorflow as tf
from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle

data, targets = shuffle(*fetch_openml('CIFAR_10', version=1, return_X_y=True))
train_sz = 50000
X_train, X_test, y_train, y_test = data[:train_sz, :], data[train_sz:, :], np.asarray(targets[:train_sz], dtype=np.int), np.asarray(targets[train_sz:], dtype=np.int)

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(X_train.shape[1],)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(10))
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer='adam')

s = 0
for _ in range(500):
    for i in range(100):
        layers = []
        for layer in model.get_weights():
            layers.append(np.random.normal(0, 1, layer.shape))
        model.set_weights(layers)
        eval = model.evaluate(X_train, y_train)
        s += eval
        print(f'Done {i}')
print(s)

After about 1 (sometimes a little before that, sometimes a little after) iteration of the outer for loop, Python crashes with exit code 137 , which usually means out of memory AFAIK. I have 16 GB of memory on my system, of which about 20% is used before running this. After running it, it steadily increases up to about 80%-90% memory usage, then drops to 60%-70% (GC kicking in?), then increases again and so on for 2-3 times until it finally crashes.

I'm on a headless Ubuntu 18.04 Server machine, with Python 3.7 in Anaconda, on Tensorflow 2.2 with a Titan X GTX GPU that is not being used for anything else (so about 11GB of memory free there).

My calculations (very pessimistic, to be sure):

  1. I have about 12 GB free when I run this.
  2. Storing the data uses 60000*32*32*3 floating point numbers, which is about 1500 MB for float64s. Let's put down 6 GB here because of all the copies I'm making. Regardless, it looks like this is what uses the most memory.
  3. The layer sizes are negligible at this point: X_train.shape[1] is 3072 ( 32*32*3 ), and 64 hidden units is nothing.
  4. model.evaluate has a default batch size of 32, so inside it, it should use about 32*32*32*3*64 float64s for the output of the middle layer. That's 50 MB, let's put in 1 GB here just to be sure again.
  5. model.evaluate probably also needs to store the predictions, so that's 50000*10 float64s, which is another 4 MB. Let's put in another 1 GB here for good measure.

Total: 6 + 1 + 1 = 8 GB . My memory usage should absolutely not exceed 80%, and I have overestimated the calculations by a lot.

Why is so much memory being used and can I optimize anything in how I manage the data?

I've tried forcing the X's to np.int using np.asarray , there's no point for float64s there, but that just makes it crash much faster - it's like it keeps both the float64s and the ints in memory or something.

Background

I'm working on a genetic algorithm that trains artificial neural networks. I've traced the crash to the computation of the fitnesses, which involves applying the trained weights stored in each individual to the neural network and evaluating the network (inner i loop, I have 100 individuals in my population). This is repeated for each generation (outermost for loop). A bit more memory is used there, but still very little.

That is why there is no fitting going on here, the weights are determined by my genetic algorithm and applied to the network.

This reduced code reproduces the issue.

I was able to replicate your issue in the notebook on google colab GPU instance with 12GB of memory. After 6 iteration my memory spiked to ~2,5GB, then ~6GB at 50 iteration and then kernel died.

by just calling garbage collector in every inner loop memmory stabilized at ~1gb and i was able to continue pass 2 outer iterations. (then i canceled it)

My suspicion here why this is happening is that tensorflow is creating references in iterations faster than garbage collector is able to collect them by default.

import gc
import tensorflow as tf
from sklearn.datasets import fetch_openml
from sklearn.utils import shuffle
import numpy as np

data, targets = shuffle(*fetch_openml('CIFAR_10', version=1, return_X_y=True))

train_sz = 50000
X_train, X_test, y_train, y_test = data[:train_sz, :], data[train_sz:, :], np.asarray(targets[:train_sz], dtype=np.int), np.asarray(targets[train_sz:], dtype=np.int)

model = tf.keras.Sequential()
model.add(tf.keras.Input(shape=(X_train.shape[1],)))
model.add(tf.keras.layers.Dense(64, activation='relu'))
model.add(tf.keras.layers.Dense(10))
model.compile(loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), optimizer='adam')

s = 0
for _ in range(500):
    for i in range(100):
        gc.collect()
        layers = []
        for layer in model.get_weights():
            layers.append(np.random.normal(0, 1, layer.shape))
        model.set_weights(layers)
        eval = model.evaluate(X_train, y_train)
        s += eval
        print(f'Done {i} eval {eval}')
s

PS: officially calling garbage collector manually is something you want to avoid, but sometimes it gets the job done.

Firstly, this code is working and keeps running without exit code in my Ubuntu 18.04 headless server (16GB RAM) with python-3.6 and tensorflow-1.14 . I have an RTX 2080 Ti GPU that has 11GB memory. I don't use Anaconda. Here is the training log from the provided reproducible code.

trining_log

I noticed that the GPU usage is high (10GB as observed from the snap below), however, this is quite normal and the processing units with this level of memory should be able to drag such a load comfortably.

gpu_usage

But then again, your GPU, Titan X GTX is as powerful as RTX 2080 Ti, having extra 1GB memory as well, should not find this deep learning model "too large" for it unless some other processes are limiting the GPU/CPU usage.

A few things to try

  1. keras model.evaluate() accepts batch_size as an argument. So try reducing it
  2. Your code necessitates changing weights inside a loop, which may be CPU intensive, So I suspect this issue is mostly attributed to CPU memory overflow. Check the cpu usage with top command and ask the question "is handling weights in a loop absolutely necessary?"
  3. I am not sure if this is relevant but I saw here recommending to allocate more memory to docker solved such issue
  4. If nothing works and you're ready to take the gamble of installing tensorflow-1.14 and run the codes on a terminal instead of a notebook, that's an option too. You never know!

Good luck.

Exit code 137 means that your process was killed by (signal 9) SIGKILL . In the case you manually stopped the script and still got this error code, then the script was killed by your OS. In most of the cases, it is caused by excessive memory usage.

Cleaning your cache data adding the following code for every variable that will not be used anymore:

VariableName = None

Check the docker as your memory limit set could be too low.

As you say that your 20% that is 3.2 GB was already in use, additional 8 GB mean that about 70% of your memory was in use, which is an enormous amount.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM