简体   繁体   English

Tensorflow内存不足和CPU / GPU使用率

[英]Tensorflow Out of memory and CPU/GPU usage

I am using Tensorflow with Keras to train a neural network for object recognition (YOLO). 我将Tensorflow与Keras结合使用来训练用于对象识别(YOLO)的神经网络。

I wrote the model and I am trying to train it using keras model.fit_generator() with batches of 32 416x416x3 images. 我编写了模型,并尝试使用keras model.fit_generator()批处理32张416x416x3图像。

I am using a NVIDIA GEFORCE RTX 2070 GPU with 8GB memory (Tensorflow uses about 6.6 GB). 我正在使用具有8GB内存的NVIDIA GEFORCE RTX 2070 GPU(Tensorflow使用约6.6 GB)。

However when I start training the model I receive messages like this: 但是,当我开始训练模型时,会收到如下消息:

tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape

2019-02-11 16:13:08.051289: W tensorflow/core/common_runtime/bfc_allocator.cc:267] Allocator (GPU_0_bfc) ran out of memory trying to allocate 338.00MiB.  Current allocation summary follows.
2019-02-11 16:13:08.057318: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (256):   Total Chunks: 1589, Chunks in use: 1589. 397.3KiB allocated for chunks. 397.3KiB in use in bin. 25.2KiB client-requested in use in bin.
2019-02-11 16:13:08.061222: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (512):   Total Chunks: 204, Chunks in use: 204. 102.0KiB allocated for chunks. 102.0KiB in use in bin. 100.1KiB client-requested in use in bin.
...
2019-02-11 16:13:08.142674: I tensorflow/core/common_runtime/bfc_allocator.cc:597] Bin (268435456):     Total Chunks: 11, Chunks in use: 11. 5.05GiB allocated for chunks. 5.05GiB in use in bin. 4.95GiB client-requested in use in bin.
2019-02-11 16:13:08.148149: I tensorflow/core/common_runtime/bfc_allocator.cc:613] Bin for 338.00MiB was 256.00MiB, Chunk State:
2019-02-11 16:13:08.150474: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 000000070B400000 of size 1280
2019-02-11 16:13:08.152627: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 000000070B400500 of size 256
2019-02-11 16:13:08.154790: I tensorflow/core/common_runtime/bfc_allocator.cc:632] Chunk at 000000070B400600 of size 256
....
2019-02-11 16:17:38.699526: I tensorflow/core/common_runtime/bfc_allocator.cc:645] Sum Total of in-use chunks: 6.11GiB
2019-02-11 16:17:38.701621: I tensorflow/core/common_runtime/bfc_allocator.cc:647] Stats:
Limit:                  6624727531
InUse:                  6557567488
MaxInUse:               6590199040
NumAllocs:                    3719
MaxAllocSize:           1624768512

2019-02-11 16:17:38.708981: W tensorflow/core/common_runtime/bfc_allocator.cc:271] ****************************************************************************************************
2019-02-11 16:17:38.712172: W tensorflow/core/framework/op_kernel.cc:1412] OP_REQUIRES failed at conv_ops_fused.cc:734 : Resource exhausted: OOM when allocating tensor with shape[16,256,52,52] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

I reported only few lines of that message but it seems clear that is a memory usage problem. 我只报告了该消息的几行,但显然这是内存使用问题。

Maybe should I use CPU in my generator function for reading images and labels from files? 也许我应该在生成器功能中使用CPU来读取文件中的图像和标签吗? In this case how should I do? 在这种情况下该怎么办?

Thank you. 谢谢。

416x416 is quite a big size for neural networks. 416x416对于神经网络来说是一个很大的尺寸。

The solution in this case is to reduce the batch size. 在这种情况下,解决方案是减小批量大小。

Other solutions that you might not like are: 您可能不喜欢的其他解决方案是:

  • reduce the model capacity (less units/filters in layers) 减少模型容量(分层中减少单位/过滤器)
  • reduce the image size 缩小图像尺寸
  • try float32 if you're using float64 (this might be very hard in Keras depending on which layers you're using) 如果您使用的是float64,请尝试float32(在Keras中这可能很难,具体取决于您所使用的图层)

Keras/Tensorflow has a strange behavior when allocating memory. 分配内存时,Keras / Tensorflow有一个奇怪的行为。 I don't know how it works, but I've seem rather big models pass and smaller models fail. 我不知道它是如何工作的,但是我似乎认为大型模型可以通过而小型模型可以通过。 These smaller models, however, had more intricate operations and branches. 但是,这些较小的模型具有更复杂的操作和分支。

An important thing: 重要的是:

If this problem is happening in your first conv layer, there is nothing that can be done in the rest of the model, you need to reduce the filters of the first layer (or the image size) 如果此问题在您的第一个转化层中发生,那么在其余模型中什么也做不了,您需要减少第一层的过滤器(或图像大小)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM