在使用 tensorflow 库训练深度学习模型时出现错误：ResourceExhaustedError OOM on gpu(128 gb RAM) 请帮助我

Question

C:\\Users\\CVL-Acoustics\\Documents\\bangla-sentence-correction-master>python train.py Sit back and relax, it will take some time to train the model... Vocabulary size 250000 WARNING:tensorflow:From C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\rnn.py:417: calling reverse_sequence (from tensorflow.python.ops.array_ops) with seq_dim is deprecated and will be removed in a future version. C:\\Users\\CVL-Acoustics\\Documents\\bangla-sentence-correction-master>python train.py 高枕无忧，训练模型需要一些时间...词汇量250000 WARNING:tensorflow:From C: \\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\rnn.py:417：使用 seq_dim 调用 reverse_sequence（来自 tensorflow.python.ops.array_ops）已被弃用，将来会被删除版本。 Instructions for updating: seq_dim is deprecated, use seq_axis instead WARNING:tensorflow:From C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:432: calling reverse_sequence (from tensorflow.python.ops.array_ops) with batch_dim is deprecated and will be removed in a future version.更新说明：seq_dim 已弃用，使用 seq_axis 代替 WARNING:tensorflow:From C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:432: call reverse_sequence（来自.python.ops.array_ops) 与 batch_dim 已弃用，并将在未来版本中删除。 Instructions for updating: batch_dim is deprecated, use batch_axis instead WARNING:tensorflow:From train.py:228: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.更新说明：batch_dim 已弃用，使用 batch_axis 代替Instructions for updating:更新说明：

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default. TensorFlow 的未来主要版本将默认允许梯度流入反向传播的标签输入。

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.见@{tf.nn.softmax_cross_entropy_with_logits_v2}。

epoch 1 training Traceback (most recent call last): File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1322, in _do_call return fn(*args) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6656,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current epoch 1 training Traceback（最近一次调用最后一次）：文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”，第 1322 行，在 _do_call return fn(* args) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”，第 1307 行，在 _run_fn 选项、feed_dict、fetch_list、target_list、run_metadata 中）文件“C :\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM 分配形状张量时[6656,250000] 并通过分配器 GPU_0_bfc 在 /job:localhost/replica:0/task:0/device:GPU:0 上键入 float [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] 提示：如果你想在 OOM 发生时查看已分配张量的列表，请添加 report_tensor_allocations_upon_oom为当前的 RunOptions allocation info.分配信息。

     [[Node: rnn/while/cond/Add/_87 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_421_rnn/while/cond/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_clooprnn/while/cond/ArgMax/dimension/_1)]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.提示：如果您想在发生 OOM 时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

During handling of the above exception, another exception occurred:在处理上述异常的过程中，又发生了一个异常：

Traceback (most recent call last): File "train.py", line 321, in _, l = sess.run([train_op, loss], fd) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 900, in run run_metadata_ptr) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1316, in _do_run run_metadata) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6656,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU回溯（最近一次调用）：文件“train.py”，第 321 行，在 _, l = sess.run([train_op, loss], fd) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\ site-packages\\tensorflow\\python\\client\\session.py", line 900, in run_metadata_ptr) 文件 "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py “，第 1135 行，在 _run feed_dict_tensor，选项，run_metadata 中）文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”，第 1316 行，在 _do_run run_metadata）文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”，第 1335 行，在 _do_call raise type(e)(node_def, op, message) tensorflow.python .framework.errors_impl.ResourceExhaustedError：在分配形状为 [6656,250000] 的张量时出现 OOM，并在 /job:localhost/replica:0/task:0/device:GPU:0 上通过分配器 GPU_0_bfc [[Node: MatMul = MatMul] 键入 float [T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU :0"](Reshape, Variable_1/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. :0"](Reshape, Variable_1/read)]] 提示：如果您想在 OOM 发生时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

     [[Node: rnn/while/cond/Add/_87 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_421_rnn/while/cond/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_clooprnn/while/cond/ArgMax/dimension/_1)]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.提示：如果您想在 OOM 发生时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

Caused by op 'MatMul', defined at: File "train.py", line 218, in decoder_logits_flat = tf.add(tf.matmul(decoder_outputs_flat, W), b) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\math_ops.py", line 2014, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\gen_math_ops.py", line 4278, in mat_mul name=name) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py", line 3414, in create_op op_def=op_def) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access由操作“MatMul”引起，定义在：文件“train.py”，第 218 行，在decoder_logits_flat = tf.add(tf.matmul(decoder_outputs_flat, W), b) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3 \\lib\\site-packages\\tensorflow\\python\\ops\\math_ops.py"，第 2014 行，在 matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) 文件 "C:\\Users\\CVL-Acoustics\\ Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\gen_math_ops.py", line 4278, in mat_mul name=name) 文件 "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python \\framework\\op_def_library.py”，第 787 行，在 _apply_op_helper op_def=op_def）文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py”，第 3414 行，在 create_op op_def=op_def) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py”，第 1740 行，在init self._traceback = self._graph._extract_stack () # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[6656,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. ResourceExhaustedError（回溯见上文）：在分配形状为 [6656,250000] 的张量时出现 OOM，并在 /job:localhost/replica:0/task:0/device:GPU:0 上通过分配器 GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] 提示：如果你想要在 OOM 发生时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

     [[Node: rnn/while/cond/Add/_87 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_421_rnn/while/cond/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_clooprnn/while/cond/ArgMax/dimension/_1)]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.提示：如果您想在发生 OOM 时查看已分配张量的列表，请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

Answer 1

There are several reasons why this would happen.发生这种情况的原因有多种。

Try reducing the parameters of the network.尝试减少网络的参数。
Try decreasing your batch size.尝试减少批量大小。
Check if another kernel is currently active that is allocating the memory.检查正在分配内存的另一个内核当前是否处于活动状态。

在使用 tensorflow 库训练深度学习模型时出现错误：ResourceExhaustedError OOM on gpu(128 gb RAM) 请帮助我

问题描述

1 个解决方案

解决方案1
0 2019-08-05 05:56:46

在使用 tensorflow 库训练深度学习模型时出现错误：ResourceExhaustedError OOM on gpu(128 gb RAM) 请帮助我

问题描述

1 个解决方案

解决方案1 0 2019-08-05 05:56:46

解决方案1
0 2019-08-05 05:56:46