简体   繁体   English

在使用 tensorflow 库训练深度学习模型时出现错误:ResourceExhaustedError OOM on gpu(128 gb RAM) 请帮助我

[英]While training deep learning model using tensorflow library i am getting error: ResourceExhaustedError OOM on gpu(128 gb RAM) Kindly help me

C:\\Users\\CVL-Acoustics\\Documents\\bangla-sentence-correction-master>python train.py Sit back and relax, it will take some time to train the model... Vocabulary size 250000 WARNING:tensorflow:From C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\rnn.py:417: calling reverse_sequence (from tensorflow.python.ops.array_ops) with seq_dim is deprecated and will be removed in a future version. C:\\Users\\CVL-Acoustics\\Documents\\bangla-sentence-correction-master>python train.py 高枕无忧,训练模型需要一些时间...词汇量250000 WARNING:tensorflow:From C: \\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\rnn.py:417:使用 seq_dim 调用 reverse_sequence(来自 tensorflow.python.ops.array_ops)已被弃用,将来会被删除版本。 Instructions for updating: seq_dim is deprecated, use seq_axis instead WARNING:tensorflow:From C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:432: calling reverse_sequence (from tensorflow.python.ops.array_ops) with batch_dim is deprecated and will be removed in a future version.更新说明:seq_dim 已弃用,使用 seq_axis 代替 WARNING:tensorflow:From C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\util\\deprecation.py:432: call reverse_sequence(来自.python.ops.array_ops) 与 batch_dim 已弃用,并将在未来版本中删除。 Instructions for updating: batch_dim is deprecated, use batch_axis instead WARNING:tensorflow:From train.py:228: softmax_cross_entropy_with_logits (from tensorflow.python.ops.nn_ops) is deprecated and will be removed in a future version.更新说明:batch_dim 已弃用,使用 batch_axis 代替Instructions for updating:更新说明:

Future major versions of TensorFlow will allow gradients to flow into the labels input on backprop by default. TensorFlow 的未来主要版本将默认允许梯度流入反向传播的标签输入。

See @{tf.nn.softmax_cross_entropy_with_logits_v2}.见@{tf.nn.softmax_cross_entropy_with_logits_v2}。

epoch 1 training Traceback (most recent call last): File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1322, in _do_call return fn(*args) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1307, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6656,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current epoch 1 training Traceback(最近一次调用最后一次):文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”,第 1322 行,在 _do_call return fn(* args) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”,第 1307 行,在 _run_fn 选项、feed_dict、fetch_list、target_list、run_metadata 中)文件“C :\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1409, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM 分配形状张量时[6656,250000] 并通过分配器 GPU_0_bfc 在 /job:localhost/replica:0/task:0/device:GPU:0 上键入 float [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] 提示:如果你想在 OOM 发生时查看已分配张量的列表,请添加 report_tensor_allocations_upon_oom为当前的 RunOptions allocation info.分配信息。

     [[Node: rnn/while/cond/Add/_87 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_421_rnn/while/cond/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_clooprnn/while/cond/ArgMax/dimension/_1)]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.提示:如果您想在发生 OOM 时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

During handling of the above exception, another exception occurred:在处理上述异常的过程中,又发生了一个异常:

Traceback (most recent call last): File "train.py", line 321, in _, l = sess.run([train_op, loss], fd) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 900, in run run_metadata_ptr) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1135, in _run feed_dict_tensor, options, run_metadata) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1316, in _do_run run_metadata) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py", line 1335, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[6656,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU回溯(最近一次调用):文件“train.py”,第 321 行,在 _, l = sess.run([train_op, loss], fd) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\ site-packages\\tensorflow\\python\\client\\session.py", line 900, in run_metadata_ptr) 文件 "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py “,第 1135 行,在 _run feed_dict_tensor,选项,run_metadata 中)文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”,第 1316 行,在 _do_run run_metadata)文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\client\\session.py”,第 1335 行,在 _do_call raise type(e)(node_def, op, message) tensorflow.python .framework.errors_impl.ResourceExhaustedError:在分配形状为 [6656,250000] 的张量时出现 OOM,并在 /job:localhost/replica:0/task:0/device:GPU:0 上通过分配器 GPU_0_bfc [[Node: MatMul = MatMul] 键入 float [T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU :0"](Reshape, Variable_1/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. :0"](Reshape, Variable_1/read)]] 提示:如果您想在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

     [[Node: rnn/while/cond/Add/_87 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_421_rnn/while/cond/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_clooprnn/while/cond/ArgMax/dimension/_1)]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.提示:如果您想在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

Caused by op 'MatMul', defined at: File "train.py", line 218, in decoder_logits_flat = tf.add(tf.matmul(decoder_outputs_flat, W), b) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\math_ops.py", line 2014, in matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\gen_math_ops.py", line 4278, in mat_mul name=name) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\op_def_library.py", line 787, in _apply_op_helper op_def=op_def) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py", line 3414, in create_op op_def=op_def) File "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py", line 1740, in init self._traceback = self._graph._extract_stack() # pylint: disable=protected-access由操作“MatMul”引起,定义在:文件“train.py”,第 218 行,在decoder_logits_flat = tf.add(tf.matmul(decoder_outputs_flat, W), b) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3 \\lib\\site-packages\\tensorflow\\python\\ops\\math_ops.py",第 2014 行,在 matmul a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name) 文件 "C:\\Users\\CVL-Acoustics\\ Anaconda3\\lib\\site-packages\\tensorflow\\python\\ops\\gen_math_ops.py", line 4278, in mat_mul name=name) 文件 "C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python \\framework\\op_def_library.py”,第 787 行,在 _apply_op_helper op_def=op_def)文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py”,第 3414 行,在 create_op op_def=op_def) 文件“C:\\Users\\CVL-Acoustics\\Anaconda3\\lib\\site-packages\\tensorflow\\python\\framework\\ops.py”,第 1740 行,在init self._traceback = self._graph._extract_stack () # pylint: disable=protected-access

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[6656,250000] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. ResourceExhaustedError(回溯见上文):在分配形状为 [6656,250000] 的张量时出现 OOM,并在 /job:localhost/replica:0/task:0/device:GPU:0 上通过分配器 GPU_0_bfc [[Node: MatMul = MatMul[T=DT_FLOAT, transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](Reshape, Variable_1/read)]] 提示:如果你想要在 OOM 发生时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

     [[Node: rnn/while/cond/Add/_87 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_421_rnn/while/cond/Add", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"](^_clooprnn/while/cond/ArgMax/dimension/_1)]]

Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.提示:如果您想在发生 OOM 时查看已分配张量的列表,请将 report_tensor_allocations_upon_oom 添加到 RunOptions 以获取当前分配信息。

There are several reasons why this would happen.发生这种情况的原因有多种。

  • Try reducing the parameters of the network.尝试减少网络的参数。
  • Try decreasing your batch size.尝试减少批量大小。
  • Check if another kernel is currently active that is allocating the memory.检查正在分配内存的另一个内核当前是否处于活动状态。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 训练深度学习 model 时出错 - Error while training a deep learning model 谁能帮我解决运行这个tensorflow程序时遇到的错误 - can anyone help me in resolving the error which I am getting while running this tensorflow program 使用嵌入层创建了 Keras 深度学习 model 但在训练时返回错误 - Created a Keras deep learning model using Embedding layer but returned an error while training 训练时张量中的奇数形状和 ResourceExhaustedError:分配张量时出现 OOM - Odd shape in tensor while training & ResourceExhaustedError: OOM when allocating tensor 在训练机器学习模型进行垃圾邮件检测时出现索引错误 - Getting an index error while training machine learning model for spam detection 使用 GPU 训练 ResNet model 时发现 OMM 错误 - Finding an OMM error while training a ResNet model using GPU 当我运行下面的程序时,在带有 ROCm 的 AMD GPU 上使用 Tensorflow 和 Keras 使用 Python 进行深度学习会出错 - Deep Learning with Python using Tensorflow and Keras on AMD GPU with ROCm gives errors when I run the program below 训练神经网络时出现ResourceExhaustedError错误 - ResourceExhaustedError error while training neural network 我正在尝试创建一个图像分类器机器学习 model。 在将我的训练数据拟合到管道时,它向我显示了这个错误 - I'm trying to create a image classifier machine learning model. while fitting my training data to pipeline, it's showing me this error 使用BP神经网络进行深度学习时训练时获得平坦的误差曲线 - Getting flat error curves while training when deep learning with BP neural nets
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM