Tensorflow GPU memory分配

Question

我正在尝试使用我的 GPU 而不是 CPU 来训练自定义 object 检测 model。 我已按照以下教程中的所有说明进行操作： https://tensorflow-object-detection-api-tutorial.readthedocs.io/

我已经测试了我的软件，一切都已安装并正常工作。

目前使用：

Windows 10
英伟达 Quadro P1000
Tensorflow 2.4.0版
CUDA 11.0
CuDNN 8.0.4
预训练 model = ssd_resnet50_v1_fpn_640x640_coco17_tpu-8
要检测的类数 = 8
批量：1

然而问题是，在训练几秒钟后，它会停止使用 GPU 并给出以下警告消息。


2020-12-29 15:01:15.444931: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:18.923079: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2020-12-29 15:01:18.928526: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2020-12-29 15:01:19.830691: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro P1000 computeCapability: 6.1
coreClock: 1.5185GHz coreCount: 4 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 89.53GiB/s
2020-12-29 15:01:19.838069: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:19.849650: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:19.854098: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:19.861632: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-29 15:01:19.867525: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-29 15:01:19.879754: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-29 15:01:19.886521: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-29 15:01:19.891603: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:19.895368: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-29 15:01:19.900144: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-29 15:01:19.910485: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: Quadro P1000 computeCapability: 6.1
coreClock: 1.5185GHz coreCount: 4 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 89.53GiB/s
2020-12-29 15:01:19.917796: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2020-12-29 15:01:19.922273: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:19.926687: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:19.930618: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2020-12-29 15:01:19.934399: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2020-12-29 15:01:19.938808: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2020-12-29 15:01:19.943155: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2020-12-29 15:01:19.947005: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:19.950826: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2020-12-29 15:01:20.491701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-12-29 15:01:20.496963: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0
2020-12-29 15:01:20.500990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N
2020-12-29 15:01:20.504027: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2991 MB memory) -> physical GPU (device: 0, name: Quadro P1000, pci bus id: 0000:01:00.0, compute capability: 6.1)
2020-12-29 15:01:20.512219: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
I1229 15:01:20.515150  5872 mirrored_strategy.py:350] Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0',)
INFO:tensorflow:Maybe overwriting train_steps: None
I1229 15:01:20.515150  5872 config_util.py:552] Maybe overwriting train_steps: None
INFO:tensorflow:Maybe overwriting use_bfloat16: False
I1229 15:01:20.515150  5872 config_util.py:552] Maybe overwriting use_bfloat16: False
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
W1229 15:01:20.530780  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\model_lib_v2.py:523: StrategyBase.experimental_distribute_datasets_from_function (from tensorflow.python.distribute.distribute_lib) is deprecated and will be removed in a future version.
Instructions for updating:
rename to distribute_datasets_from_function
INFO:tensorflow:Reading unweighted datasets: ['annotations/train.record']
I1229 15:01:20.546404  5872 dataset_builder.py:148] Reading unweighted datasets: ['annotations/train.record']
INFO:tensorflow:Reading record datasets for input file: ['annotations/train.record']
I1229 15:01:20.546404  5872 dataset_builder.py:77] Reading record datasets for input file: ['annotations/train.record']
INFO:tensorflow:Number of filenames to read: 1
I1229 15:01:20.546404  5872 dataset_builder.py:78] Number of filenames to read: 1
WARNING:tensorflow:num_readers has been reduced to 1 to match input file shards.
W1229 15:01:20.546404  5872 dataset_builder.py:86] num_readers has been reduced to 1 to match input file shards.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
W1229 15:01:20.546404  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:103: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_deterministic`.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
W1229 15:01:20.562029  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\builders\dataset_builder.py:222: DatasetV1.map_with_legacy_function (from tensorflow.python.data.ops.dataset_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map()
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
W1229 15:01:25.685788  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sparse_to_dense (from tensorflow.python.ops.sparse_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Create a `tf.sparse.SparseTensor` and use `tf.sparse.to_dense` instead.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
W1229 15:01:27.908942  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\dispatch.py:201: sample_distorted_bounding_box (from tensorflow.python.ops.image_ops_impl) is deprecated and will be removed in a future version.
Instructions for updating:
`seed2` arg is deprecated.Use sample_distorted_bounding_box_v2 instead.
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
W1229 15:01:29.229117  5872 deprecation.py:339] From C:\Users\USER-\Anaconda3\lib\site-packages\object_detection\inputs.py:281: to_float (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
2020-12-29 15:01:31.781125: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\keras\backend.py:434: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
  warnings.warn('`tf.keras.backend.set_learning_phase` is deprecated and '
2020-12-29 15:01:48.972736: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2020-12-29 15:01:49.258182: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2020-12-29 15:01:49.287771: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2020-12-29 15:01:49.822205: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2020-12-29 15:01:49.866004: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

WARNING:tensorflow:Unresolved object in checkpoint: (root).model._groundtruth_lists
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._groundtruth_lists
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._batched_prediction_tensor_names
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._batched_prediction_tensor_names
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._sorted_head_names
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._sorted_head_names
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers
W1229 15:01:52.823682  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._head_scope_conv_layers
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._head_scope_conv_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head._box_encoder_layers
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._box_prediction_head._box_encoder_layers
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads.class_predictions_with_background
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._prediction_heads.class_predictions_with_background
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.0
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.0
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.1
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.1
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.2
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.2
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.3
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.3
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.4
W1229 15:01:52.839355  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._additional_projection_layers.4
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.7.moving_variance
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.axis
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.axis
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.gamma
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.beta
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_mean
WARNING:tensorflow:Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
W1229 15:01:53.076874  5872 util.py:161] Unresolved object in checkpoint: (root).model._box_predictor._base_tower_layers_for_heads.class_predictions_with_background.4.10.moving_variance
WARNING:tensorflow:A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
W1229 15:01:53.076874  5872 util.py:169] A checkpoint was restored (e.g. tf.train.Checkpoint.restore or tf.keras.Model.load_weights) but not all checkpointed values were used. See above for specific issues. Use expect_partial() on the load status object, e.g. tf.train.Checkpoint.restore(...).expect_partial(), to silence these warnings, or use assert_consumed() to make the check explicit. See https://www.tensorflow.org/guide/checkpoint#loading_mechanics for details.
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.468799  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
I1229 15:01:53.484427  5872 cross_device_ops.py:565] Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
WARNING:tensorflow:From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
W1229 15:01:59.423827 15152 deprecation.py:537] From C:\Users\USER-\Anaconda3\lib\site-packages\tensorflow\python\util\deprecation.py:605: calling map_fn_v2 (from tensorflow.python.ops.map_fn) with dtype is deprecated and will be removed in a future version.
Instructions for updating:
Use fn_output_signature instead
2020-12-29 15:02:11.320699: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.73GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.351326: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.74GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.751709: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.08GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:11.784850: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.09GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:12.607912: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.14GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:12.644507: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.15GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.057969: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.05GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.092341: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.299573: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.04GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2020-12-29 15:02:13.331704: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.06GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.

此外，我没有在我的设备上运行任何其他程序，所以它用完 memory 似乎有点奇怪。

Answer 1

我看到这个错误消息有四个不同的原因，有不同的解决方案：

1.你没了memory

Maybe your GPU memory is filled, when TensorFlow makes initialization and your computational graph ends up using all the memory of your physical device then this issue arises. 解决方案是在 GPU 选项中使用 allow growth = True 。 如果为 GPU 启用 memory 增长，则运行时初始化将不会分配设备上的所有 memory。 导入后使用以下代码段可能会解决您的问题。

import tensorflow as tf
physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
    tf.config.experimental.set_memory_growth(physical_devices[0], True)

2.你有缓存问题

我经常通过关闭我的 python 进程、删除 ~/.nv 目录（在 linux、rm -rf ~/.nv 上）并重新启动 ZA7F5F35426B927411FC9231B5638 进程来解决此错误。 我不完全知道为什么会这样。 它可能至少部分与第二个选项有关：

3. While using Keras, Keras layers(classes) were directly imported from keras instead of tensorflow.keras

Keras 包含在 TensorFlow 2.0 以上。 所以

remove import keras and replace from keras.module.module import class statement to --> from tensorflow.keras.module.module import class

例如替换from keras.layers import Conv3D,ConvLSTM2D,Conv3DTranspose, Input用这个： from tensorflow.keras.layers import Conv3D,ConvLSTM2D,Conv3DTranspose, Input

3.您有不兼容的CUDA、TensorFlow、NVIDIA驱动等版本。

如果您从来没有使用过类似的模型，您没有用完 VRAM，您的导入正确，如步骤 3 中所述，并且您的缓存是干净的，我会 go 回来并设置 CUDA + Z074DD699710DA0EC1EB45F 使用最佳安装3783指南 - 按照https://www.tensorflow.org/install/gpu上的说明而不是 NVIDIA / CUDA 站点上的说明，我取得了最大的成功。 Lambda 堆栈： https://lambdalabs.com/lambda-stack-deep-learning-software也是 go 的好方法。

Tensorflow GPU memory分配

问题描述

1 个解决方案

解决方案1
0 2020-12-29 15:54:57

1.你没了memory

2.你有缓存问题

3. While using Keras, Keras layers(classes) were directly imported from keras instead of tensorflow.keras

3.您有不兼容的CUDA、TensorFlow、NVIDIA驱动等版本。

Tensorflow GPU memory分配

问题描述

1 个解决方案

解决方案1 0 2020-12-29 15:54:57

1.你没了memory

2.你有缓存问题

3. While using Keras, Keras layers(classes) were directly imported from keras instead of tensorflow.keras

3.您有不兼容的CUDA、TensorFlow、NVIDIA驱动等版本。

解决方案1
0 2020-12-29 15:54:57