简体   繁体   English

将 Keras Model 应用于符号张量导致 TF2.0 Memory 泄漏

[英]TF2.0 Memory Leak From Applying Keras Model to Symbolic Tensor

tldr : Memory usage of my implementation apparently grows with the number of samples passed through it, but there should be nothing in the network/sample feeding that cares about how many samples were passed so far. tldr :Memory 我的实现的使用显然随着通过它的样本数量而增长,但是网络/样本馈送中应该没有任何东西关心到目前为止通过了多少样本。


When passing a large badge of high-dimensional data through a custom Keras model created through the functional API, I observe what I assume is a constant growth in GPU memory usage with growing number of observed instances. When passing a large badge of high-dimensional data through a custom Keras model created through the functional API, I observe what I assume is a constant growth in GPU memory usage with growing number of observed instances. The following is a minimal example for the process of passing the samples through the network:以下是通过网络传递样本的过程的最小示例

sequence_length = 100
batch_size = 128

env = gym.make("ShadowHand-v1")
_, _, joint = build_shadow_brain(env, bs=batch_size)
optimizer: tf.keras.optimizers.Optimizer = tf.keras.optimizers.SGD()

start_time = time.time()
for t in tqdm(range(sequence_length), disable=False):
    sample_batch = (
        tf.random.normal([batch_size, 1, 200, 200, 3]),
        tf.random.normal([batch_size, 1, 48]),
        tf.random.normal([batch_size, 1, 92]),
        tf.random.normal([batch_size, 1, 7])
    )

    with tf.GradientTape() as tape:
        out, v = joint(sample_batch)
        loss = tf.reduce_mean(out - v)

    grads = tape.gradient(loss, joint.trainable_variables)
    optimizer.apply_gradients(zip(grads, joint.trainable_variables))
    joint.reset_states()

print(f"Execution Time: {time.time() - start_time}")

I am aware of the fact that this is a large sample given the batch size, however what I would expect would be an instant OOM error if it were in fact too large for my GPU and I also assume that 6GB of VRAM actually suffice.我知道考虑到批量大小,这是一个很大的样本,但是如果它实际上对于我的 GPU 来说太大了,我会期望立即出现 OOM 错误,并且我还假设 6GB 的 VRAM 实际上就足够了。 That is because only after 33 instances the OOM error occurs, leading me to the suspicion that there is a growing usage of memory.那是因为只有在 33 个实例之后才会发生 OOM 错误,这让我怀疑 memory 的使用量正在增加。

See in the following the Keras summary of my model:请参阅以下Keras 我的 model 的摘要

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
visual_input (InputLayer)       [(32, None, 200, 200 0                                            
__________________________________________________________________________________________________
proprioceptive_input (InputLaye [(32, None, 48)]     0                                            
__________________________________________________________________________________________________
somatosensory_input (InputLayer [(32, None, 92)]     0                                            
__________________________________________________________________________________________________
time_distributed (TimeDistribut (None, None, 64)     272032      visual_input[0][0]               
__________________________________________________________________________________________________
time_distributed_1 (TimeDistrib (None, None, 8)      848         proprioceptive_input[0][0]       
__________________________________________________________________________________________________
time_distributed_2 (TimeDistrib (None, None, 8)      3032        somatosensory_input[0][0]        
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, None, 80)     0           time_distributed[0][0]           
                                                                 time_distributed_1[0][0]         
                                                                 time_distributed_2[0][0]         
__________________________________________________________________________________________________
time_distributed_3 (TimeDistrib (None, None, 48)     3888        concatenate[0][0]                
__________________________________________________________________________________________________
time_distributed_4 (TimeDistrib (None, None, 48)     0           time_distributed_3[0][0]         
__________________________________________________________________________________________________
time_distributed_5 (TimeDistrib (None, None, 32)     1568        time_distributed_4[0][0]         
__________________________________________________________________________________________________
time_distributed_6 (TimeDistrib (None, None, 32)     0           time_distributed_5[0][0]         
__________________________________________________________________________________________________
goal_input (InputLayer)         [(32, None, 7)]      0                                            
__________________________________________________________________________________________________
concatenate_1 (Concatenate)     (32, None, 39)       0           time_distributed_6[0][0]         
                                                                 goal_input[0][0]                 
__________________________________________________________________________________________________
lstm (LSTM)                     (32, 32)             9216        concatenate_1[0][0]              
__________________________________________________________________________________________________
dense_10 (Dense)                (32, 20)             660         lstm[0][0]                       
__________________________________________________________________________________________________
dense_11 (Dense)                (32, 20)             660         lstm[0][0]                       
__________________________________________________________________________________________________
activation (Activation)         (32, 20)             0           dense_10[0][0]                   
__________________________________________________________________________________________________
activation_1 (Activation)       (32, 20)             0           dense_11[0][0]                   
__________________________________________________________________________________________________
concatenate_2 (Concatenate)     (32, 40)             0           activation[0][0]                 
                                                                 activation_1[0][0]               
__________________________________________________________________________________________________
dense_12 (Dense)                (32, 1)              33          lstm[0][0]                       
==================================================================================================
Total params: 291,937
Trainable params: 291,937
Non-trainable params: 0
__________________________________________________________________________________________________

As you can see there is an LSTM layer in this network.如您所见,该网络中有一个 LSTM 层。 It should usually be stateful, however I already turned this off because I assumed the problem to somehow lie there.它通常应该是有状态的,但是我已经关闭了它,因为我认为问题出在那儿。 In fact I already tried the following, without eliminating the issue实际上我已经尝试了以下方法,但没有消除问题

  • Turn of statefulness状态转换
  • Entirely removing the LSTM完全删除 LSTM
  • not calculating any gradients不计算任何梯度
  • rebuilding the model after every instance在每个实例之后重建 model

and have now reached the end of my ideas concerning potential causes of the issue.现在我对问题的潜在原因的想法已经结束。

I have also forced the process onto the CPU and inspected the standard memory (the OOM does not happen here because I have a lot more RAM than VRAM).我还将该进程强制到 CPU 上并检查了标准 memory(这里没有发生 OOM,因为我的 RAM 比 VRAM 多得多)。 Interestingly the memory usage jumps up and down but has an upwards trend.有趣的是,memory 的使用率上下波动,但有上升趋势。 For every instance, about 2GB memory are taken, but when freeing the memory before taking the next sample, only about 200MB of memory less than what was taken is released.对于每个实例,大约 2GB memory 被占用,但是在获取下一个样本之前释放 memory 时,仅释放了大约 200MB 的 memory。

EDIT 1: As mentioned in the comments the issue might be the fact that calling the model on the input adds to the computation graph.编辑 1:正如评论中提到的,问题可能是在输入上调用 model 会增加计算图。 However I cannot use joint.predict() because I need to calculate the Gradients.但是我不能使用joint.predict() ,因为我需要计算梯度。

EDIT 2: I monitored the growth in memory a little more closely and indeed what happens is that every iteration keeps some memory reserved as you can see here for the first 9 steps:编辑 2:我更密切地监测了 memory 的增长,实际上发生的情况是每次迭代都会保留一些 memory,正如您在此处看到的前 9 个步骤:

0: 8744054784
1: 8885506048
2: 9015111680
3: 9143611392
4: 9272619008
5: 9405591552
6: 9516531712
7: 9647988736
8: 9785032704

This was done with a batch size of 32. The size of one sample_batch is 256 * (200 * 200 * 3 + 48 + 92 + 7) * 32 = 984244224 bits (precision is float32 ) which more or less shows that indeed the problem must be that when passing the sample through the network, the sample is added to the graph because it is symbolic, as @MatiasValdenegro suggested.这是在批量大小为 32 的情况下完成的。一个sample_batch的大小为256 * (200 * 200 * 3 + 48 + 92 + 7) * 32 = 984244224位(精度为float32 ),这或多或少表明确实存在问题一定是当通过网络传递样本时,样本被添加到图中,因为它是符号的,正如@MatiasValdenegro 所建议的那样。 So I guess the question now boils down to "how to make a tensor non-symbolic" if that even is a thing.所以我想现在的问题可以归结为“如何使张量变得非符号化”,如果那是一件事的话。

Disclaimer : I know that you cannot reproduce the issue with the given code because there are missing components, but I cannot provide the full project's code here.免责声明:我知道您无法使用给定代码重现该问题,因为缺少组件,但我无法在此处提供完整项目的代码。

It took me a while but I have now solved the issue.我花了一段时间,但我现在已经解决了这个问题。 As I have edited into the Question before: The issue is that the functional API of Keras seems to be adding each sample to the computation graph without removing the input we don't need anymore after the iteration.正如我之前编辑过的问题:问题是 Keras 的功能 API 似乎正在将每个样本添加到计算图中,而不会删除迭代后我们不再需要的输入。 There seems to be no easy way of explicitly removing it, however the tf.function decorator can solve the issue .似乎没有简单的方法可以显式删除它,但是tf.function装饰器可以解决这个问题

Taking my code example from above, it can be applied as follows:以我上面的代码示例为例,它可以应用如下:

sequence_length = 100
batch_size = 256

env = gym.make("ShadowHand-v1")
_, _, joint = build_shadow_brain(env, bs=batch_size)
plot_model(joint, to_file="model.png")
optimizer: tf.keras.optimizers.Optimizer = tf.keras.optimizers.SGD()

@tf.function
def _train():
    start_time = time.time()

    for _ in tqdm(range(sequence_length), disable=False):
        sample_batch = (tf.convert_to_tensor(tf.random.normal([batch_size, 4, 224, 224, 3])),
                        tf.convert_to_tensor(tf.random.normal([batch_size, 4, 48])),
                        tf.convert_to_tensor(tf.random.normal([batch_size, 4, 92])),
                        tf.convert_to_tensor(tf.random.normal([batch_size, 4, 7])))

        with tf.GradientTape() as tape:
            out, v = joint(sample_batch, training=True)
            loss = tf.reduce_mean(out - v)

        grads = tape.gradient(loss, joint.trainable_variables)
        optimizer.apply_gradients(zip(grads, joint.trainable_variables))

    print(f"Execution Time: {time.time() - start_time}")

_train()

That is, the training loop can be shipped in a function with the tf.function decorator.也就是说,训练循环可以在带有tf.function装饰器的 function 中提供。 This means that the training will be executed in graph mode, and for some reason, this removes the issue, most likely because the graph will be dumped after the function ends.这意味着训练将以图形模式执行,并且由于某种原因,这消除了问题,很可能是因为在 function 结束后,图形将被转储。 For more on tf.function see the TF2.0 Guide on the topic.有关tf.function的更多信息,请参阅有关该主题的TF2.0 指南

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM