Tensorflow#object_detection/train.py 上的 CUDA_ERROR_OUT_OF_MEMORY

Question

I'm running Tensorflow Object Detection API to train my own detector using the object_detection/train.py script, found here .我正在运行 Tensorflow 对象检测 API 以使用object_detection/train.py脚本训练我自己的检测器，在此处找到。 The problem is that I'm getting CUDA_ERROR_OUT_OF_MEMORY constantly.问题是我不断收到CUDA_ERROR_OUT_OF_MEMORY 。

I found some suggestions to reduce the batch size so the trainer consumes less memory, but I reduced from 16 to 4 and I'm still getting the same error.我找到了一些减少批量大小的建议，以便训练器消耗更少的内存，但我从 16 减少到 4 并且我仍然遇到相同的错误。 The difference is that when using batch_size=16, the error was thrown in step ~18 and now it is been thrown in step ~70.不同的是，当使用batch_size=16时，错误在第~18步抛出，现在在~70步抛出。 EDIT: setting batch_size=1 didn't solve the problem, as I still got the error at step ~2700.编辑：设置 batch_size=1 没有解决问题，因为我在步骤 ~2700 仍然遇到错误。

What can I do to make it run smoothly until I stop the training proccess?在我停止训练过程之前，我能做些什么来让它顺利运行？ I don't really need to get a fast training.我真的不需要接受快速培训。

EDIT: I'm currently using a GTX 750 Ti 2GB for this.编辑：我目前为此使用 GTX 750 Ti 2GB。 The GPU is not being used for anything else than training and providing monitor image.除了训练和提供监视器图像之外，GPU 不用于其他任何用途。 Currently, I'm using only 80 images for training and 20 images for evaluation.目前，我仅使用 80 张图像进行训练，并使用 20 张图像进行评估。

Answer 1

I think is not about batch_size, because you can start the training at first place. 我认为这与batch_size无关，因为您可以从头开始训练。

open a terminal ans run 打开终端ans运行

nvidia-smi -l 英伟达-smi -l

to check if there are other process kick in when this error happens. 检查发生此错误时是否还有其他进程启动。 if you set batch_size=16, you can find out pretty quick. 如果设置batch_size = 16，则可以很快找到。

Answer 2

Found the solution for my problem. 找到了解决我问题的方法。 The batch_size was not the problem, but a higher batch_size made the training memory consumption increase faster, because I was using the config.gpu_options.allow_growth = True configuration. batch_size并不是问题，但是更高的batch_size会使训练内存消耗更快地增加，因为我使用的是config.gpu_options.allow_growth = True配置。 This setting allows Tensorflow to increase memory consumption when needed and tries to use until 100% of GPU memory. 此设置允许Tensorflow在需要时增加内存消耗，并尝试使用直到100％的GPU内存。

The problem was that I was running the eval.py script at the same time (as recommended in their tutorial) and it was using parte of the GPU memory. 问题是我同时运行了eval.py脚本（按照他们的教程中的建议），并且使用了GPU内存的一部分。 When the train.py script tried to use all 100%, the error was thrown. 当train.py脚本尝试全部使用100％时，将引发错误。

I solved it by settings the maximum use percentage to 70% for the training proccess. 我通过将培训过程的最大使用百分比设置为70％来解决了该问题。 It also solved the problem of stuttering while training. 它还解决了训练时口吃的问题。 This may not be the optimum value for my GPU, but it is configurable using config.gpu_options.per_process_gpu_memory_fraction = 0.7 setting, for example. 这可能不是我的GPU的最佳值，但是可以使用config.gpu_options.per_process_gpu_memory_fraction = 0.7设置进行配置。

Answer 3

Another option is to dedicate the GPU for training and use the CPU for evaluation .另一种选择是将 GPU 专用于训练并使用CPU 进行评估。

Disadvantage: Evaluation will consume large portion of your CPU, but only for a few seconds every time a training checkpoint is created, which is not often.缺点：评估会消耗你的大部分 CPU，但每次创建训练检查点时只会消耗几秒钟，这种情况并不常见。
Advantage: 100% of your GPU is used for training all the time优势： 100% 的 GPU 始终用于训练

To target CPU, set this environment variable before you run the evaluation script:要以 CPU 为目标，请在运行评估脚本之前设置此环境变量：

export CUDA_VISIBLE_DEVICES=-1

You can explicitly set the evaluate batch job size to 1 in pipeline.config to consume less memory:您可以在pipeline.config中将评估批处理作业大小显式设置为 1 以消耗更少的内存：

eval_config: {
  metrics_set: "coco_detection_metrics"
  use_moving_averages: false
  batch_size: 1;
}

If you're still having issues, TensorFlow may not be releasing GPU memory between training runs.如果仍有问题，TensorFlow 可能不会在训练运行之间释放 GPU 内存。 Try restarting your terminal or IDE and try again.尝试重新启动终端或 IDE，然后重试。 This answer has more details.这个答案有更多细节。

Tensorflow#object_detection/train.py 上的 CUDA_ERROR_OUT_OF_MEMORY

问题描述

3 个解决方案

解决方案1
1 2017-11-29 08:34:52

解决方案2
0 已采纳 2017-11-29 13:24:07

解决方案3
0 2021-11-17 14:20:08

Tensorflow#object_detection/train.py 上的 CUDA_ERROR_OUT_OF_MEMORY

问题描述

3 个解决方案

解决方案1 1 2017-11-29 08:34:52

解决方案2 0 已采纳 2017-11-29 13:24:07

解决方案3 0 2021-11-17 14:20:08

解决方案1
1 2017-11-29 08:34:52

解决方案2
0 已采纳 2017-11-29 13:24:07

解决方案3
0 2021-11-17 14:20:08