简体   繁体   English

Tensorflow 2.1 无法获得卷积算法。 这可能是因为 cuDNN 初始化失败

[英]Tensorflow 2.1 Failed to get convolution algorithm. This is probably because cuDNN failed to initialize

I am using anaconda python 3.7 and tensorflow 2.1 with cuda 10.1 and cudnn 7.6.5, and trying to run the retinaset ( https://github.com/fizyr/keras-retinanet ):我正在使用带有 cuda 10.1 和 cudnn 7.6.5 的 anaconda python 3.7 和 tensorflow 2.1,并尝试运行视网膜集( https://github.com/fizyr/keras-retinanet ):

python keras_retinanet/bin/train.py --freeze-backbone --random-transform --batch-size 8 --steps 500 --epochs 10 csv annotations.csv classes.csv

Here below are the resultant errors:以下是由此产生的错误:

Epoch 1/10
2020-02-10 20:34:37.807590: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library cudnn64_7.dll
2020-02-10 20:34:38.835777: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-02-10 20:34:39.753051: E tensorflow/stream_executor/cuda/cuda_dnn.cc:329] Could not create cudnn handle: CUDNN_STATUS_INTERNAL_ERROR
2020-02-10 20:34:39.776706: W tensorflow/core/common_runtime/base_collective_executor.cc:217] BaseCollectiveExecutor::StartAbort Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[{{node conv1/convolution}}]]
Traceback (most recent call last):
  File "keras_retinanet/bin/train.py", line 530, in <module>
    main()
  File "keras_retinanet/bin/train.py", line 525, in main
    initial_epoch=args.initial_epoch
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\keras\legacy\interfaces.py", line 91, in wrapper
    return func(*args, **kwargs)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\keras\engine\training.py", line 1732, in fit_generator
    initial_epoch=initial_epoch)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\keras\engine\training_generator.py", line 220, in fit_generator
    reset_metrics=False)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\keras\engine\training.py", line 1514, in train_on_batch
    outputs = self.train_function(ins)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\tensorflow_core\python\keras\backend.py", line 3727, in __call__
    outputs = self._graph_fn(*converted_inputs)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\tensorflow_core\python\eager\function.py", line 1551, in __call__
    return self._call_impl(args, kwargs)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\tensorflow_core\python\eager\function.py", line 1591, in _call_impl
    return self._call_flat(args, self.captured_inputs, cancellation_manager)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\tensorflow_core\python\eager\function.py", line 1692, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\tensorflow_core\python\eager\function.py", line 545, in call
    ctx=ctx)
  File "C:\Anaconda\Anaconda3.7\lib\site-packages\tensorflow_core\python\eager\execute.py", line 67, in quick_execute
    six.raise_from(core._status_to_exception(e.code, message), None)
  File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.UnknownError:  Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
         [[node conv1/convolution (defined at C:\Anaconda\Anaconda3.7\lib\site-packages\keras\backend\tensorflow_backend.py:3009) ]] [Op:__inference_keras_scratch_graph_12376]

Function call stack:
keras_scratch_graph

Anyone has experienced similar problems?任何人都遇到过类似的问题?

I was getting the same error when trying to train my CNN model on two GPUs using tf.distribute.MirroredStrategy() .尝试使用tf.distribute.MirroredStrategy()在两个 GPU 上训练我的 CNN 模型时,我遇到了同样的错误。 I found a workaround for now that allows me to use both of them (though training on a single GPU worked just fine).我现在找到了一种解决方法,允许我同时使用它们(尽管在单个 GPU 上训练效果很好)。 Try putting the following at the beginning of your application:尝试将以下内容放在应用程序的开头:

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session =tf.compat.v1.InteractiveSession(config=config)

Hope that helps!希望有帮助!

Do this:做这个:

physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
tf.config.experimental.set_memory_growth(physical_devices[0], True)

According to this comment in a Tensorflow GitHub issue, this error can be caused by your GPU's memory limit being hit (you can check GPU usage using the commands nvidia-smi or gpustat ).根据 Tensorflow GitHub 问题中的此评论,此错误可能是由于您的 GPU 的内存限制受到限制(您可以使用命令nvidia-smigpustat检查 GPU 使用情况)。

If setting tf.config.experimental.set_memory_growth = True does not work, hopefully limiting GPU memory usage manually works:如果设置tf.config.experimental.set_memory_growth = True不起作用,希望手动限制 GPU 内存使用:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    # Restrict TensorFlow to only allocate 1GB * 2 of memory on the first GPU
    try:
        tf.config.experimental.set_virtual_device_configuration(
            gpus[0],
            [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024 * 2)])
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Virtual devices must be set before GPUs have been initialized
        print(e)

Credit goes to BryanBo-Cao for his comment.感谢 BryanBo-Cao的评论。

Got the same error with python 3.7.9, tensorflow 2.1.0, cuda 10.1.105, and cudnn 7.6.5.使用 python 3.7.9、tensorflow 2.1.0、cuda 10.1.105 和 cudnn 7.6.5 遇到相同的错误。 Solved after updating GPU driver from NVIDIA .NVIDIA更新 GPU 驱动程序后解决。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 无法获得卷积算法。 这可能是因为cuDNN无法初始化 - Failed to get convolution algorithm. This is probably because cuDNN failed to initialize 获取卷积算法失败。 这可能是因为 cuDNN 初始化失败, - Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, Tensorflow 错误。 获取卷积算法失败。 这可能是因为 cuDNN 未能初始化 - Tensorflow error. Failed to get convolution algorithm. This is probably because cuDNN failed to initialize Tensorflow 2.0 不能使用 GPU,cuDNN 有问题吗? : 获取卷积算法失败。 这可能是因为 cuDNN 未能初始化 - Tensorflow 2.0 can't use GPU, something wrong in cuDNN? :Failed to get convolution algorithm. This is probably because cuDNN failed to initialize cuDNN 错误 无法获得卷积算法。 这可能是因为 cuDNN 初始化失败 - cuDNN Error Failed to get convolution algorithm. This is probably because cuDNN failed to initialize CUDA Tensorflow 版本,nvidia-smi 版本问题。 获取卷积算法失败。 这可能是因为 cuDNN 初始化失败, - CUDA Tensorflow Version ,nvidia-smi version issue. Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, Google Colab - Tensorflow model_main_tf2:无法获得卷积算法。 这可能是因为 cuDNN 初始化失败 - Google Colab - Tensorflow model_main_tf2: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize 无法创建 cudnn 句柄:CUDNN_STATUS_ALLOC_FAILED,无法获得卷积算法。 这可能是因为 cuDNN 未能初始化 - Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED , Failed to get convolution algorithm. This is probably because cuDNN failed to initialize CNN编译错误:无法获得卷积算法。 这可能是因为cuDNN无法初始化, - CNN compiling error : Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, 获取卷积算法失败。 这可能是因为 cuDNN 初始化失败,所以尝试查看是否打印了警告日志消息 - Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM