简体繁体 English

rtx 2070s 无法从设备分配 GPU 内存：CUDA_ERROR_OUT_OF_MEMORY：内存不足

[英]rtx 2070s failed to allocate gpu memory from device:CUDA_ERROR_OUT_OF_MEMORY: out of memory

原文 2020-01-11 13:32:15 2 1 python/ tensorflow

tf 2.0.0-gpu CUDA 10.0 RTX2070super tf 2.0.0-gpu CUDA 10.0 RTX2070super

hi.你好。 i got a problem regarding allocating gmemory.我在分配 gmemory 时遇到了问题。 The initial allocation of memory is 7GB like this.这样初始分配的内存是7GB。

Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6994 MB memory)创建 TensorFlow 设备（/job:localhost/replica:0/task:0/device:GPU:0，6994 MB 内存）

2020-01-11 22:19:22.983048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 2020-01-11 22:19:23.786225: I tensorflow/stream_executor/cuda/cuda_driver.cc:830] failed to allocate 2.78G (2989634304 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2020-01-11 22:19:24.159338: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 2020-01-11 22:19:22.983048: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] 成功打开动态库 libcudnn.so.7 2020-01-11 22:19:23.7862225/stream_executorflow_exe /cuda/cuda_driver.cc:830] 无法从设备分配 2.78G（2989634304 字节）：CUDA_ERROR_OUT_OF_MEMORY：内存不足 2020-01-11 22:19:24.159338：我 tensorflow/platform/stream_default/executor/ 44] 成功打开动态库 libcublas.so.10.0

Limit: 7333884724 InUse: 5888382720 MaxInUse: 6255411968 NumAllocs: 1264 MaxAllocSize: 2372141056限制：7333884724 InUse：5888382720 MaxInUse：6255411968 NumAllocs：1264 MaxAllocSize：2372141056

but i can only use 5900MB memory and the rest of memory always fails to be allocated.但我只能使用 5900MB 内存，其余的内存总是无法分配。

i guess that if whole gpu memory is used in rtx 2070s, i use 2 types data typse(float16, float32).我想如果在 rtx 2070s 中使用整个 GPU 内存，我会使用 2 种类型的数据类型（float16、float32）。 so i got a policy by using this codes所以我通过使用这个代码得到了一个政策

opt = tf.keras.optimizers.Adam(1e-4) opt = tf.keras.optimizers.Adam(1e-4)

opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt) opt = tf.train.experimental.enable_mixed_precision_graph_rewrite(opt)

Still, the allocation always fails.尽管如此，分配总是失败。

1 个解决方案

Tensorflow memory management can be frustrating. Tensorflow 内存管理可能令人沮丧。

Main takeaway: whenever you see OOM there is actually not enough memory and you either have to reduce your model size or batch size.主要内容：每当您看到 OOM 时，实际上内存不足，您必须减少模型大小或批量大小。 TF would throw OOM when it tries to allocate sufficient memory, regardless of how much memory has been allocated before.无论之前分配了多少内存，TF 在尝试分配足够的内存时都会抛出 OOM。

On the start, TF would try to allocate a reasonably large chunk of memory which would be equivalent to about 90-98% of the whole memory available - 5900MB in your case.一开始，TF 会尝试分配相当大的内存块，这相当于整个可用内存的 90-98% - 在您的情况下为 5900MB。 Then, when actual data starts to take more than that, TF would additionally try to allocate sufficient amount of memory or a bit more - 2.78G.然后，当实际数据开始占用更多时，TF 会另外尝试分配足够数量的内存或更多 - 2.78G。 And if that does not fit it would throw OOM, like in your case.如果这不合适，它会抛出 OOM，就像你的情况一样。 Your GPU could not fit 5.9+2.8Gb.您的 GPU 无法容纳 5.9+2.8Gb。 The last chunk of 2.78G might actually be a little more than TF needs, but it would anyhow be used later if you have multiple training steps because maximum required memory can fluctuate a bit between identical Session.run's.最后一块 2.78G 实际上可能比 TF 需要的多一点，但是如果您有多个训练步骤，无论如何都会在以后使用它，因为所需的最大内存可能会在相同的 Session.run 之间略有波动。