简体   繁体   English

cuDNN 启动失败(tensorflow-gpu/CUDA)

[英]cuDNN launch failure (tensorflow-gpu/CUDA)

Traceback (most recent call last):
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1322, in _do_call
    return fn(*args)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1307, in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1409, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([202027,64,1,1])
     [[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
     [[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "NeuralFM.py", line 350, in <module>
    model.train(data.Train_data, data.Validation_data, data.Test_data)
  File "NeuralFM.py", line 266, in train
    init_train = self.evaluate(Train_data)
  File "NeuralFM.py", line 311, in evaluate
    predictions = self.sess.run((self.out), feed_dict=feed_dict)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 900, in run
    run_metadata_ptr)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1135, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1316, in _do_run
    run_metadata)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/client/session.py", line 1335, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: cuDNN launch failure : input shape ([202027,64,1,1])
     [[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
     [[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

Caused by op 'bn_fm_1/FusedBatchNorm', defined at:
  File "NeuralFM.py", line 349, in <module>
    model = NeuralFM(data.features_M, args.hidden_factor, eval(args.layers), args.loss_type, args.pretrain, args.epoch, args.batch_size, args.lr, args.lamda, eval(args.keep_prob), args.optimizer, args.batch_norm, activation_function, args.verbose, args.early_stop)
  File "NeuralFM.py", line 89, in __init__
    self._init_graph()
  File "NeuralFM.py", line 123, in _init_graph
    self.FM = self.batch_norm_layer(self.FM, train_phase=self.train_phase, scope_bn='bn_fm')
  File "NeuralFM.py", line 224, in batch_norm_layer
    is_training=False, reuse=True, trainable=True, scope=scope_bn)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.py", line 183, in func_with_args
    return func(*args, **current_args)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 596, in batch_norm
    scope=scope)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 382, in _fused_batch_norm
    is_training, _fused_batch_norm_training, _fused_batch_norm_inference)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 214, in smart_cond
    return static_cond(pred_value, fn1, fn2)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/utils.py", line 194, in static_cond
    return fn2()
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/contrib/layers/python/layers/layers.py", line 379, in _fused_batch_norm_inference
    data_format=data_format)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 906, in fused_batch_norm
    name=name)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 3465, in _fused_batch_norm
    is_training=is_training, name=name)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 3414, in create_op
    op_def=op_def)
  File "/home/alex/anaconda3/envs/tensorflow/lib/python3.6/site-packages/tensorflow/python/framework/ops.py", line 1740, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

InternalError (see above for traceback): cuDNN launch failure : input shape ([202027,64,1,1])
     [[Node: bn_fm_1/FusedBatchNorm = FusedBatchNorm[T=DT_FLOAT, data_format="NCHW", epsilon=0.001, is_training=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](bn_fm_1/FusedBatchNorm-0-TransposeNHWCToNCHW-LayoutOptimizer, bn_fm/gamma/read, bn_fm/beta/read, bn_fm/moving_mean/read, bn_fm/moving_variance/read)]]
     [[Node: AddN/_31 = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_202_AddN", tensor_type=DT_FLOAT, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

I keep getting this error, I've tried everything from downgrading CUDA, cuDNN, and tensorflow-gpu.我一直收到这个错误,我尝试了从降级 CUDA、cuDNN 和 tensorflow-gpu 的所有方法。

I'm currently on CUDA 9.0, cuDNN v7.4.2 for CUDA 9.0, tensorflow-gpu 1.9 and nothing I do seems to help.我目前使用的是 CUDA 9.0、CUDA 9.0 的 cuDNN v7.4.2、tensorflow-gpu 1.9,我所做的一切似乎都没有帮助。 I'm running out of ideas, I've got every dependency I could imagine.我的想法快用完了,我有我能想象到的所有依赖。

I'm trying to run this: https://github.com/hexiangnan/neural_factorization_machine我正在尝试运行这个: https : //github.com/hexiangnan/neural_factorization_machine

EDIT: I have a feeling this is connected to https://github.com/tensorflow/tensorflow/issues/8090 but as I'm a little new to all this, I'm not sure if I'm right or how to address this.编辑:我有一种感觉,这与https://github.com/tensorflow/tensorflow/issues/8090 有关,但由于我对这一切有点陌生,我不确定我是否正确或如何解决这个问题。

I met the same error.我遇到了同样的错误。 The reason for mine is that my GPU does not have enough memory for the process.我的原因是我的 GPU 没有足够的内存来处理这个过程。

I'm probably a few of years late to be of any help Alex but I've come up on this issue when on Windows with a specific GPU.我可能晚了几年才对亚历克斯有任何帮助,但我在使用特定 GPU 的 Windows 上遇到了这个问题。 Don't ask me why but adding不要问我为什么要补充

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '/gpu:0'

if you have a single GPU works for me如果你有一个 GPU 适合我

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Tensorflow-GPU 不使用 GPU 和 CUDA,CUDNN - Tensorflow-GPU not using GPU with CUDA,CUDNN tensorflow-gpu 2.3 安装了 cuda 并且 cudnn 无法检测到 GPU? - tensorflow-gpu 2.3 install with cuda and cudnn can't detect GPU? TF 2.0 GPU配置,如何在pycharm中创建另一个虚拟环境后配置tensorflow-gpu 2.0和cuda,cudnn - TF 2.0 GPU cofingure, how to configure the tensorflow-gpu 2.0 and cuda, cudnn after create another virtual environment in pycharm tensorflow-gpu 2.2 适用于 CUDA 10.2 但需要 cuDNN 7.6.4,它在 NVIDIA 存档中没有 CUDA 10.2 的下载文件 - tensorflow-gpu 2.2 works with CUDA 10.2 but requires cuDNN 7.6.4 which doesn't have a download file in NVIDIA archive for CUDA 10.2 Anaconda 中的 Tensorflow-Gpu 和 Cuda 驱动程序存在问题 - Problem with Tensorflow-Gpu and Cuda drivers in Anaconda 当在链接时间内使用 tensorflow-gpu cudnn 失败时 - When use tensorflow-gpu cudnn fails during the link time 无法在 tensorflow-gpu 上使用 GPU:“无法创建 cudnn 句柄:CUDNN_STATUS_INTERNAL_ERROR” - Cannot use GPU on the tensorflow-gpu: "Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR" 使用Cuda 10安装pip tensorflow-gpu后出现错误 - Error after installing pip tensorflow-gpu with cuda 10 具有多个 cuda 版本的系统上的 tensorflow-gpu 安装问题 - tensorflow-gpu installation problem on system with multiple cuda versions tensorflow-gpu 2.1.0 和 CUDA 10.1 看不到我的 GPU - My GPUs are not visible with tensorflow-gpu 2.1.0 and CUDA 10.1
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM