简体   繁体   中英

OoM: Out of Memory Error during hyper parameter optimization with Talos on a tensorflow model

while searching for the optimal hyperparameters for my AlexNet with the help of Talos, I get am Out of Memory Error. It always happens at the same epoch (32/240), even if I change the parameters slightly (to exclude that the cause is an unfavorable constellation).

Error message:

ResourceExhaustedError:  OOM when allocating tensor with shape[32,96,26,26] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
     [[node max_pooling2d_1/MaxPool (defined at D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_keras_scratch_graph_246047]

Function call stack:
keras_scratch_graph

Here is my Code:

Session configuration:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction = 0.99
sess = tf.compat.v1.Session(config = config)
K.set_session(sess)

Configuration and fitting of the AlexNet:

def alexnet(x_train, y_train, x_val, y_val, params):
    
    K.clear_session()
    
    if params['activation'] == 'leakyrelu':
        activation_layer = LeakyReLU(alpha = params['leaky_alpha'])
    elif params['activation'] == 'relu':
        activation_layer = ReLU()
    
    model = Sequential([
        Conv2D(filters=96, kernel_size=(11,11), strides=(4,4), activation='relu', input_shape=(224,224,Global.num_image_channels)),
        BatchNormalization(),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Conv2D(filters=256, kernel_size=(5,5), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Conv2D(filters=384, kernel_size=(3,3), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        Conv2D(filters=384, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        Conv2D(filters=256, kernel_size=(1,1), strides=(1,1), activation='relu', padding="same"),
        BatchNormalization(),
        MaxPooling2D(pool_size=(3,3), strides=(2,2)),
        Flatten(),
        Dense(4096, activation=activation_layer),
        Dropout(0.5),#todo
        Dense(4096, activation=activation_layer),
        Dropout(0.5),#todo
        Dense(units = 2, activation=activation_layer)
        #Dense(10, activation='softmax')
    ])
        
    model.compile(
        optimizer = params['optimizer'](lr = lr_normalizer(params['lr'], params['optimizer'])), 
        loss = Global.loss_funktion, 
        metrics = get_reduction_metric(Global.reduction_metric)
    )
    train_generator, valid_generator = create_data_pipline(params['batch_size'], params['samples'])
    tg_steps_per_epoch = train_generator.n // train_generator.batch_size
    vg_validation_steps = valid_generator.n // valid_generator.batch_size
    print('Steps per Epoch: {}, Validation Steps: {}'.format(tg_steps_per_epoch, vg_validation_steps))
    
    
    startTime = datetime.now()
    
    out = model.fit(
        x = train_generator,
        epochs = params['epochs'],
        validation_data = valid_generator,
        steps_per_epoch = tg_steps_per_epoch,
        validation_steps = vg_validation_steps,
        #callbacks = [checkpointer]
        workers = 8
    )
    print("Time taken:", datetime.now() - startTime)

    return out, model

Hyperparameter list:

hyper_parameter = {
    'samples': [20000],
    'epochs': [1],
    'batch_size': [32, 64],
    'optimizer': [Adam],
    'lr': [1, 2],
    'first_neuron': [1024, 2048, 4096],
    'dropout': [0.25, 0.50],
    'activation': ['leakyrelu', 'relu'],
    'hidden_layers': [0, 1, 2, 3, 4],
    'leaky_alpha': [0.1] #Default bei LeakyReLU, sonst PReLU
}

Run Talos:

dummy_x = np.empty((1, 2, 3, 224, 224))
dummy_y = np.empty((1, 2))

with tf.device('/device:GPU:0'):
    
        t = ta.Scan(
            x = dummy_x,
            y = dummy_y,
            model = alexnet,
            params = hyper_parameter,
            experiment_name = '{}'.format(Global.dataset),
            #shuffle=False,
            reduction_metric = Global.reduction_metric,
            disable_progress_bar = False,
            print_params = True,
            clear_session = 'tf',
            save_weights = False
        )
        

t.data.to_csv(Global.target_dir + Global.results, index = True)

The memory usage is always quite high but it does not rise over the epochs but it varies a little.

Nvidia SMI Output:

在此处输入图片说明

Can someone please help me here?

========================================================================== What I already tried:

1) Splitting up the Talos run:

This caused the same error.

hyper_parameter = {
    'samples': [20000],
    'epochs': [1],
    'batch_size': [32, 64],
    'optimizer': [Adam],
    'lr': [1, 2, 3, 5],
    'first_neuron': [9999],
    'dropout': [0.25, 0.50],
    'activation': ['leakyrelu', 'relu'],
    'hidden_layers': [9999],
    'leaky_alpha': [0.1] #Default bei LeakyReLU, sonst PReLU
}

dummy_x = np.empty((1, 2, 3, 224, 224))
dummy_y = np.empty((1, 2))
first = True

for h in [0, 1, 2, 3, 4]:
    hyper_parameter['hidden_layers']=[h]
    for fn in [1024, 2048, 4096]:
        hyper_parameter['first_neuron']=[fn]

        with tf.device('/device:GPU:1'):

                t = ta.Scan(
                    x = dummy_x,
                    y = dummy_y,
                    model = alexnet,
                    params = hyper_parameter,
                    experiment_name = '{}'.format(Global.dataset),
                    #shuffle=False,
                    reduction_metric = Global.reduction_metric,
                    disable_progress_bar = False,
                    print_params = True,
                    clear_session = 'tf',
                    save_weights = False
                )
                if(first):
                    t.data.to_csv(Global.target_dir + Global.results, index = True, mode='a')
                    first = False
                else:
                    t.data.to_csv(Global.target_dir + Global.results, index = True, mode='a', header=False)

==========================================================================

2) Run the model within an own thread

Searching for the cause, I found that some people complain about the same issue and blame TensorFlow for not executing K.clear_session() .

Maybe the idea is stupid af but I tried to train the model in an extra thread.

from threading import Thread
def gen_model_thread(x_train, y_train, x_val, y_val, params):
    
    thread = Thread(target=alexnet, args=(x_train, y_train, x_val, y_val, params))
    thread.start()
    return_value = thread.join()
    return return_value
with tf.device('/device:GPU:0'):
    
        t = ta.Scan(
            x = dummy_x,
            y = dummy_y,
            model = gen_model_thread,
            params = hyper_parameter,
            experiment_name = '{}'.format(Global.dataset),
            #shuffle=False,
            reduction_metric = Global.reduction_metric,
            disable_progress_bar = False,
            print_params = True,
            clear_session = True,
            save_weights = False
        )

This caused a type error:

Traceback (most recent call last):
  File "D:\anaconda\envs\tf_ks\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "D:\anaconda\envs\tf_ks\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "<ipython-input-3-2942ae0a0a56>", line 5, in gen_model
    model = alexnet(params['activation'], params['leaky_alpha'])
  File "<ipython-input-2-2a405202aa5a>", line 27, in alexnet
    Dense(units = 2, activation=activation_layer)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py", line 94, in __init__
    self.add(layer)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\sequential.py", line 162, in add
    name=layer.name + '_input')
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py", line 178, in Input
    input_tensor=tensor)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\engine\input_layer.py", line 87, in __init__
    name=self.name)
  File "D:\anaconda\envs\tf_ks\lib\site-packages\keras\backend\tensorflow_backend.py", line 73, in symbolic_fn_wrapper
    if _SYMBOLIC_SCOPE.value:
AttributeError: '_thread._local' object has no attribute 'value'

TypeError: cannot unpack non-iterable NoneType object

I know, my last chance is to do it manually but I think I will head towards the same problem while training my model later anyway.

Many thanks for taking care of my problem, reading my question and correcting the spelling errors in my text^^.

I am looking forward to receiving constructive solutions from this amazing community here! (:

==========================================================================

GPU: NVIDIA RTX 2080Ti and Titan Xp Collectors Edition (i tried both)

TensorFlow: 2.1.0

Keras: 2.3.1

Talos: 1.0

Disabling eager execution solved the problem for me: tf.compat.v1.disable_eager_execution()

https://github.com/autonomio/talos/issues/482

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM