简体   繁体   English

获取卷积算法失败。 这可能是因为 cuDNN 初始化失败,

[英]Failed to get convolution algorithm. This is probably because cuDNN failed to initialize,

In Tensorflow/ Keras when running the code from https://github.com/pierluigiferrari/ssd_keras , use the estimator: ssd300_evaluation.在 Tensorflow/Keras 中运行来自https://github.com/pierluigiferrari/ssd_keras的代码时,使用估算器:ssd300_evaluation。 I received this error.我收到了这个错误。

Failed to get convolution algorithm.获取卷积算法失败。 This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.这可能是因为 cuDNN 初始化失败,所以尝试查看上面是否打印了警告日志消息。

This is very similar to the unsolved question: Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize这与未解决的问题非常相似: Google Colab Error : Failed to get convolution algorithm.这可能是因为cuDNN未能初始化

With the issue I'm running:对于我正在运行的问题:

python: 3.6.4.蟒蛇:3.6.4。

Tensorflow Version: 1.12.0. Tensorflow 版本:1.12.0。

Keras Version: 2.2.4. Keras 版本:2.2.4。

CUDA: V10.0. CUDA:V10.0。

cuDNN: V7.4.1.5. cuDNN:V7.4.1.5。

NVIDIA GeForce GTX 1080. NVIDIA GeForce GTX 1080。

Also I ran:我也跑了:

import tensorflow as tf
with tf.device('/gpu:0'):
      a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
      b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
      c = tf.matmul(a, b)
with tf.Session() as sess:
print (sess.run(c))

With no errors or issues.没有错误或问题。

The minimalist example is:极简主义的例子是:

 from keras import backend as K
 from keras.models import load_model
 from keras.optimizers import Adam
 from scipy.misc import imread
 import numpy as np
 from matplotlib import pyplot as plt

 from models.keras_ssd300 import ssd_300
 from keras_loss_function.keras_ssd_loss import SSDLoss
 from keras_layers.keras_layer_AnchorBoxes import AnchorBoxes
 from keras_layers.keras_layer_DecodeDetections import DecodeDetections
 from keras_layers.keras_layer_DecodeDetectionsFast import DecodeDetectionsFast
 from keras_layers.keras_layer_L2Normalization import L2Normalization
 from data_generator.object_detection_2d_data_generator import DataGenerator
 from eval_utils.average_precision_evaluator import Evaluator
 import tensorflow as tf
 %matplotlib inline
 import keras
 keras.__version__



 # Set a few configuration parameters.
 img_height = 300
 img_width = 300
 n_classes = 20
 model_mode = 'inference'


 K.clear_session() # Clear previous models from memory.

 model = ssd_300(image_size=(img_height, img_width, 3),
            n_classes=n_classes,
            mode=model_mode,
            l2_regularization=0.0005,
            scales=[0.1, 0.2, 0.37, 0.54, 0.71, 0.88, 1.05], # The scales 
 for MS COCO [0.07, 0.15, 0.33, 0.51, 0.69, 0.87, 1.05]
            aspect_ratios_per_layer=[[1.0, 2.0, 0.5],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5, 3.0, 1.0/3.0],
                                     [1.0, 2.0, 0.5],
                                     [1.0, 2.0, 0.5]],
            two_boxes_for_ar1=True,
            steps=[8, 16, 32, 64, 100, 300],
            offsets=[0.5, 0.5, 0.5, 0.5, 0.5, 0.5],
            clip_boxes=False,
            variances=[0.1, 0.1, 0.2, 0.2],
            normalize_coords=True,
            subtract_mean=[123, 117, 104],
            swap_channels=[2, 1, 0],
            confidence_thresh=0.01,
            iou_threshold=0.45,
            top_k=200,
            nms_max_output_size=400)

 # 2: Load the trained weights into the model.

 # TODO: Set the path of the trained weights.
 weights_path = 'C:/Users/USAgData/TF SSD 
 Keras/weights/VGG_VOC0712Plus_SSD_300x300_iter_240000.h5'

 model.load_weights(weights_path, by_name=True)

 # 3: Compile the model so that Keras won't complain the next time you load it.

 adam = Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-08, decay=0.0)

 ssd_loss = SSDLoss(neg_pos_ratio=3, alpha=1.0)

 model.compile(optimizer=adam, loss=ssd_loss.compute_loss)


dataset = DataGenerator()

# TODO: Set the paths to the dataset here.
dir= "C:/Users/USAgData/TF SSD Keras/VOC/VOCtest_06-Nov-2007/VOCdevkit/VOC2007/"
Pascal_VOC_dataset_images_dir = dir+ 'JPEGImages'
Pascal_VOC_dataset_annotations_dir = dir + 'Annotations/'
Pascal_VOC_dataset_image_set_filename = dir+'ImageSets/Main/test.txt'

# The XML parser needs to now what object class names to look for and in which order to map them to integers.
classes = ['background',
           'aeroplane', 'bicycle', 'bird', 'boat',
           'bottle', 'bus', 'car', 'cat',
           'chair', 'cow', 'diningtable', 'dog',
           'horse', 'motorbike', 'person', 'pottedplant',
           'sheep', 'sofa', 'train', 'tvmonitor']

dataset.parse_xml(images_dirs=[Pascal_VOC_dataset_images_dir],
                  image_set_filenames=[Pascal_VOC_dataset_image_set_filename],
                  annotations_dirs=[Pascal_VOC_dataset_annotations_dir],
                  classes=classes,
                  include_classes='all',
                  exclude_truncated=False,
                  exclude_difficult=False,
                  ret=False)



evaluator = Evaluator(model=model,
                      n_classes=n_classes,
                      data_generator=dataset,
                      model_mode=model_mode)



results = evaluator(img_height=img_height,
                    img_width=img_width,
                    batch_size=8,
                    data_generator_mode='resize',
                    round_confidences=False,
                    matching_iou_threshold=0.5,
                    border_pixels='include',
                    sorting_algorithm='quicksort',
                    average_precision_mode='sample',
                    num_recall_points=11,
                    ignore_neutral_boxes=True,
                    return_precisions=True,
                    return_recalls=True,
                    return_average_precisions=True,
                    verbose=True)

I've seen this error message for three different reasons, with different solutions:我出于三种不同的原因看到了此错误消息,并使用了不同的解决方案:

1. You have cache issues 1.你有缓存问题

I regularly work around this error by shutting down my python process, removing the ~/.nv directory (on linux, rm -rf ~/.nv ), and restarting the Python process.我经常通过关闭 python 进程、删除~/.nv目录(在 linux 上, rm -rf ~/.nv )并重新启动 Python 进程来解决此错误。 I don't exactly know why this works.我不完全知道为什么会这样。 It's probably at least partly related to the second option:它可能至少部分与第二个选项有关:

2. You're out of memory 2. 你的内存不足

The error can also show up if you run out of graphics card RAM.如果图形卡 RAM 用完,该错误也会出现。 With an nvidia GPU you can check graphics card memory usage with nvidia-smi .使用 nvidia GPU,您可以使用nvidia-smi检查显卡内存使用情况。 This will give you a readout of how much GPU RAM you have in use (something like 6025MiB / 6086MiB if you're almost at the limit) as well as a list of what processes are using GPU RAM.这将为您提供正在使用的 GPU RAM 量(例如6025MiB / 6086MiB如果您几乎达到极限)以及正在使用 GPU RAM 的进程列表。

If you've run out of RAM, you'll need to restart the process (which should free up the RAM) and then take a less memory-intensive approach.如果您的 RAM 用完,您将需要重新启动该进程(这将释放 RAM),然后采用较少内存密集型的方法。 A few options are:几个选项是:

  • reducing your batch size减少批量大小
  • using a simpler model使用更简单的模型
  • using less data使用更少的数据
  • limit TensorFlow GPU memory fraction: For example, the following will make sure TensorFlow uses <= 90% of your RAM:限制 TensorFlow GPU 内存比例:例如,以下内容将确保 TensorFlow 使用 <= 90% 的 RAM:
import keras
import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.per_process_gpu_memory_fraction = 0.9  # 0.6 sometimes works better for folks
keras.backend.tensorflow_backend.set_session(tf.Session(config=config))

This can slow down your model evaluation if not used together with the items above, presumably since the large data set will have to be swapped in and out to fit into the small amount of memory you've allocated.如果不与上述项目一起使用,这可能会减慢您的模型评估速度,大概是因为必须交换进出大数据集以适应您分配的少量内存。

A second option is to have TensorFlow start out using only a minimum amount of memory and then allocate more as needed (documented here ):第二种选择是让 TensorFlow 开始时仅使用最少量的内存,然后根据需要分配更多内存(在此处记录):

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

3. You have incompatible versions of CUDA, TensorFlow, NVIDIA drivers, etc. 3. 您的 CUDA、TensorFlow、NVIDIA 驱动程序等版本不兼容。

If you've never had similar models working, you're not running out of VRAM and your cache is clean, I'd go back and set up CUDA + TensorFlow using the best available installation guide - I have had the most success with following the instructions at https://www.tensorflow.org/install/gpu rather than those on the NVIDIA / CUDA site.如果你从来没有使用过类似的模型,你没有用完 VRAM并且你的缓存是干净的,我会回去使用最好的安装指南设置 CUDA + TensorFlow - 我在以下方面取得了最大的成功https://www.tensorflow.org/install/gpu 上的说明,而不是 NVIDIA / CUDA 站点上的说明。 Lambda Stack is also a good way to go. Lambda Stack也是一个不错的方法。

I had the same issue, I solved it thanks to that :我遇到了同样的问题,因此我解决了它:

os.environ['TF_FORCE_GPU_ALLOW_GROWTH'] = 'true'

or或者

physical_devices = tf.config.experimental.list_physical_devices('GPU')
if len(physical_devices) > 0:
   tf.config.experimental.set_memory_growth(physical_devices[0], True)

I had this error and I fixed it by uninstalling all CUDA and cuDNN versions from my system.我遇到了这个错误,我通过从我的系统中卸载所有 CUDA 和 cuDNN 版本来修复它。 Then I installed CUDA Toolkit 9.0 (without any patches) and cuDNN v7.4.1 for CUDA 9.0 .然后我安装了CUDA Toolkit 9.0 (没有任何补丁)和cuDNN v7.4.1 for CUDA 9.0

Keras is included in TensorFlow 2.0 above. Keras 包含在上面的 TensorFlow 2.0 中。 So所以

  • remove import keras and删除import keras
  • replace from keras.module.module import class statement to --> from tensorflow.keras.module.module import classfrom keras.module.module import class语句替换为 --> from tensorflow.keras.module.module import class
  • Maybe your GPU memory is filled.也许您的 GPU 内存已满。 So use allow growth = True in GPU option.所以在 GPU 选项中使用 allow growth = True 。 This is deprecated now.现在已弃用。 But use this below code snippet after imports may solve your problem.但是在导入后使用下面的代码片段可能会解决您的问题。
import tensorflow as tf
from tensorflow.compat.v1.keras.backend import set_session
config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU
config.log_device_placement = True  # to log device placement (on which device the operation ran)
sess = tf.compat.v1.Session(config=config)
set_session(sess)

I also had the same issue with Tensorflow 2.4 and Cuda 11.0 with CuDNN v 8.0.4.我在使用 CuDNN v 8.0.4 的 Tensorflow 2.4 和 Cuda 11.0 也遇到了同样的问题。 I had wasted almost 2 to 3 days to solve this issue.我已经浪费了将近 2 到 3 天的时间来解决这个问题。 The problem was just a driver mismatch.问题只是驱动程序不匹配。 I was installing Cuda 11.0 Update 1, I thought this is update 1 so might work well but that was the culprit there.我正在安装 Cuda 11.0 Update 1,我认为这是更新 1,所以可能运行良好,但那是那里的罪魁祸首。 I uninstalled Cuda 11.0 Update 1 and installed it without an update.我卸载了 Cuda 11.0 Update 1 并在没有更新的情况下安装了它。 Here is the list of drivers that worked for TensorFlow 2.4 at RTX 2060 6GB GPU.以下是适用于 RTX 2060 6GB GPU 上的 TensorFlow 2.4 的驱动程序列表。

A list of required hardware and software requirements are mentioned here此处提到了所需的硬件和软件要求列表

I also had to do this我也不得不这样做

import tensorflow as tf
physical_devices = tf.config.list_physical_devices('GPU') 
tf.config.experimental.set_memory_growth(physical_devices[0], True)

to avoid this error为了避免这个错误

2020-12-23 21:54:14.971709: I tensorflow/stream_executor/stream.cc:1404] [stream=000001E69C1DA210,impl=000001E6A9F88E20] did not wait for [stream=000001E69C1DA180,impl=000001E6A9F88730]
2020-12-23 21:54:15.211338: F tensorflow/core/common_runtime/gpu/gpu_util.cc:340] CPU->GPU Memcpy failed
[I 21:54:16.071 NotebookApp] KernelRestarter: restarting kernel (1/5), keep random ports
kernel 8b907ea5-33f1-4b2a-96cc-4a7a4c885d74 restarted
kernel 8b907ea5-33f1-4b2a-96cc-4a7a4c885d74 restarted

These are some of the error samples which I was getting这些是我得到的一些错误样本

Type 1类型 1

UnpicklingError: invalid load key, 'H'.

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-2-f049ceaad66a> in <module>

Type 2类型 2


InternalError: Blas GEMM launch failed : a.shape=(15, 768), b.shape=(768, 768), m=15, n=768, k=768 [Op:MatMul]

During handling of the above exception, another exception occurred:

Type 3类型 3

failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.534375: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.534683: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.534923: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.539327: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.539523: E tensorflow/stream_executor/cuda/cuda_dnn.cc:336] Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED
2020-12-23 21:31:04.539665: W tensorflow/core/framework/op_kernel.cc:1763] OP_REQUIRES failed at conv_ops_fused_impl.h:697 : Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.

The problem is with the incompatibility of newer versions of tensorflow 1.10.x plus versions with cudnn 7.0.5 and cuda 9.0.问题在于较新版本的 tensorflow 1.10.x 以及带有 cudnn 7.0.5 和 cuda 9.0 的版本不兼容。 Easiest fix is to downgrade tensorflow to 1.8.0最简单的解决方法是将 tensorflow 降级到 1.8.0

pip install --upgrade tensorflow-gpu==1.8.0 pip install --upgrade tensorflow-gpu==1.8.0

This is a follow up to https://stackoverflow.com/a/56511889/2037998 point 2.这是对https://stackoverflow.com/a/56511889/2037998第 2 点的跟进。

2. You're out of memory 2. 你的内存不足

I used the following code to limit the GPU RAM usage:我使用以下代码来限制 GPU RAM 的使用:

import tensorflow as tf

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1*X GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=(1024*4))])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

This code sample comes from:TensorFlow: Use a GPU: Limiting GPU memory growth Put this code before of any other TF/Keras code you are using.此代码示例来自:TensorFlow:使用 GPU:限制 GPU 内存增长将此代码放在您正在使用的任何其他 TF/Keras 代码之前。

Note: The application might still use a bit more GPU RAM than the number above.注意:应用程序可能仍会使用比上述数字多一点的 GPU RAM。

Note 2: If the system also runs other applications (like a UI) these programs can also consume some GPU RAM.注 2:如果系统还运行其他应用程序(如 UI),这些程序也会消耗一些 GPU RAM。 (Xorg, Firefox,... sometimes up to 1GB of GPU RAM combined) (Xorg, Firefox,... 有时高达 1GB 的 GPU RAM)

Same error i got , The Reason of getting this error is due to the mismatch of the version of the cudaa/cudnn with your tensorflow version there are two methods to solve this:我遇到了同样的错误,出现此错误的原因是由于 cudaa/cudnn 的版本与您的 tensorflow 版本不匹配,有两种方法可以解决此问题:

  1. Either you Downgrade your Tensorflow Version pip install --upgrade tensorflowgpu==1.8.0要么你降级你的 Tensorflow 版本pip install --upgrade tensorflowgpu==1.8.0

  2. Or You can follow the steps at Here .或者您可以按照此处的步骤操作。

    tip: Choose your ubuntu version and follow the steps.:-)提示:选择您的 ubuntu 版本并按照步骤操作。:-)

I had this same issue with RTX 2080. Then following code worked for me.我在 RTX 2080 上遇到了同样的问题。然后下面的代码对我有用。

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

I was having the same issue but adding these line of code at the start solved my problem:我遇到了同样的问题,但在开始时添加这些代码行解决了我的问题:

physical_devices = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_memory_growth(physical_devices[0], True)

works with tensorflow V2.适用于 tensorflow V2。

I had this problem after upgrading to TF2.0.升级到TF2.0后我遇到了这个问题。 The following started giving error:以下开始给出错误:

   outputs = tf.nn.conv2d(images, filters, strides=1, padding="SAME")

I am using Ubuntu 16.04.6 LTS (Azure datascience VM) and TensorFlow 2.0.我使用的是 Ubuntu 16.04.6 LTS(Azure 数据科学 VM)和 TensorFlow 2.0。 Upgraded per instruction on this TensorFlow GPU instructions page .在此 TensorFlow GPU 指令页面上按指令升级。 This resolved the issue for me.这为我解决了这个问题。 By the way, its bunch of apt-get update/installs and I executed all of them.顺便说一句,它的一堆 apt-get 更新/安装,我执行了所有这些。

Just add只需添加

from tensorflow.compat.v1 import ConfigProto
from tensorflow.compat.v1 import InteractiveSession

config = ConfigProto()
config.gpu_options.allow_growth = True
session = InteractiveSession(config=config)

I had the same problem.我有同样的问题。 I am using conda environment so my packages are automatically managed by conda.我正在使用 conda 环境,所以我的包由 conda 自动管理。 I solved the problem by constraining the memory allocation of tensorflow v2, python 3.x我通过限制tensorflow v2、python 3.x的内存分配解决了这个问题

physical_devices = tf.config.experimental.list_physical_devices(‘GPU’)
tf.config.experimental.set_memory_growth(physical_devices[0], True)

This solved the my problem.这解决了我的问题。 However, this limits the memory very much.但是,这非常限制了内存。 When I simulteniously run the当我同时运行

nvidia-smi

I saw that it was about 700mb.我看到它大约是700mb。 So in order to see more options one can inspect the codes at tensorflow's website因此,为了查看更多选项,可以检查tensorflow 网站上的代码

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  # Restrict TensorFlow to only allocate 1GB of memory on the first GPU
  try:
    tf.config.experimental.set_virtual_device_configuration(
        gpus[0],
        [tf.config.experimental.VirtualDeviceConfiguration(memory_limit=1024)])
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Virtual devices must be set before GPUs have been initialized
    print(e)

In my case the code snip above solved the problem perfectly.就我而言,上面的代码片段完美地解决了这个问题。

Note: I didn't tried installing tensorflow with pip, this worked with conda installed tensorflow effectively.注意:我没有尝试使用 pip 安装 tensorflow,这与 conda 安装的 tensorflow 有效。

Ubuntu: 18.04 Ubuntu:18.04

python: 3.8.5蟒蛇:3.8.5

tensorflow: 2.2.0张量流:2.2.0

cudnn : 7.6.5库德恩:7.6.5

cudatoolkit : 10.1.243 cudatoolkit:10.1.243

As already observed by Anurag Bhalekar above, this can be fixed by a dirty workaround by setting up and running a model in your code before loading an old model with load_model() from keras.正如上面 Anurag Bhalekar 已经观察到的那样,这可以通过一个肮脏的解决方法来解决,方法是在使用 keras 的 load_model() 加载旧模型之前在代码中设置和运行模型。 This correctly initializes cuDNN which can then be used for load_model(), it seems.这似乎正确初始化了 cuDNN,然后可以将其用于 load_model()。

In my case, I am using Spyder IDE to run all my python scripts.就我而言,我使用 Spyder IDE 来运行我所有的 python 脚本。 Specifically, I set up, train and save a CNN in one script.具体来说,我在一个脚本中设置、训练和保存 CNN。 After that, another script loads the saved model for visualization.之后,另一个脚本加载保存的模型以进行可视化。 If I open Spyder and directly run the visualization script to load an old, saved model, I get the same error as mentioned above.如果我打开 Spyder 并直接运行可视化脚本来加载旧的、保存的模型,我会得到与上面提到的相同的错误。 I was still able to load the model and to modify it, but when I tried to create a prediction, I got the error.我仍然能够加载模型并对其进行修改,但是当我尝试创建预测时,出现错误。

However, If I first run my training script in a Spyder instance and then run the visualization script in the same Sypder instance, it works fine without any errors:但是,如果我首先在 Spyder 实例中运行我的训练脚本,然后在同一个 Sypder 实例中运行可视化脚本,它可以正常工作,没有任何错误:

#training a model correctly initializes cuDNN
model=Sequential()
model.add(Conv2D(32,...))
model.add(Dense(num_classes,...))
model.compile(...)
model.fit() #this all works fine

Then afterwards, the following code including load_model() works fine:然后,包括 load_model() 的以下代码工作正常:

#this script relies on cuDNN already being initialized by the script above
from keras.models import load_model
model = load_model(modelPath) #works
model = Model(inputs=model.inputs, outputs=model.layers[1].output) #works
feature_maps = model.predict(img) #produces the error only if the first piece of code is not run

I could not figure out why this is or how to solve the problem in a different way, but for me, training a small working keras model before using load_model() is a quick and dirty fix that does not require any reinstallation of cuDNN or otherwise.我无法弄清楚为什么会这样或如何以不同的方式解决问题,但对我来说,在使用 load_model() 之前训练一个小的工作 keras 模型是一种快速而肮脏的修复,不需要重新安装 cuDNN 或其他方式.

Was facing the same issue, I think GPU is not able to load all the data at once.面临同样的问题,我认为 GPU 无法一次加载所有数据。 I resolved it by reducing the batch size.我通过减少批量大小来解决它。

I was struggling with this problem for a week.我在这个问题上挣扎了一个星期。 The reason was very silly: I used high-res photos for training.原因很傻:我用高分辨率照片进行训练。

Hopefully, this will save someone's time :)希望这会节省某人的时间:)

The problem can also occur if there are incompatible version of cuDNN, which could be the case if you installed Tensorflow with conda, as conda also installs CUDA and cuDNN while installing Tensorflow.如果存在不兼容的 cuDNN 版本,也可能出现此问题,如果您使用 conda 安装 Tensorflow,则可能会出现这种情况,因为 conda 在安装 Tensorflow 时还会安装 CUDA 和 cuDNN。

The solution is to install the Tensorflow with pip, and install CUDA and cuDNN separately without conda eg if you have CUDA 10.0.130 and cuDNN 7.4.1 ( tested configurations ) , then解决方案是使用pip安装Tensorflow,并在没有conda的情况下分别安装CUDA和cuDNN,例如如果您有CUDA 10.0.130和cuDNN 7.4.1 测试配置 ,那么

pip install tensorflow-gpu==1.13.1

1) close all other notebooks, that use GPU 1) 关闭所有其他使用 GPU 的笔记本

2) TF 2.0 needs cuDNN SDK (>= 7.4.1) 2) TF 2.0 需要cuDNN SDK (>= 7.4.1)

extract and add path to 'bin' folder into "environment variables / system variables / path": "D:\\Programs\\x64\\Nvidia\\cudnn\\bin"将“bin”文件夹的路径提取并添加到“环境变量/系统变量/路径”中:“D:\\Programs\\x64\\Nvidia\\cudnn\\bin”

In my case this error encountered when I directly load the model from .json and .h5 files and attempted to predict output on certain inputs. 就我而言,当我直接从.json和.h5文件加载模型并尝试预测某些输入的输出时,会遇到此错误。 Hence, before doing anything like this, I tried training an example model on mnist That allowed the cudNN to initialize, 因此,在进行此类操作之前,我尝试在mnist上训练一个示例模型,该模型允许cudNN进行初始化, 在此处输入图片说明

I had the same problem but with a simpler solution than the others posted here.我有同样的问题,但比这里发布的其他人的解决方案更简单。 I have both CUDA 10.0 and 10.2 installed but I only had cuDNN for 10.2 and this version [at the time of this post] is not compatible with TensorFlow GPU.我同时安装了 CUDA 10.0 和 10.2,但我只有 10.2 的 cuDNN,并且这个版本 [在本文发布时] 与 TensorFlow GPU 不兼容。 I just installed the cuDNN for CUDA 10.0 and now everything runs fine!我刚刚为 CUDA 10.0 安装了 cuDNN,现在一切正常!

Workaround: Fresh install TF 2.0 and ran a simple Minst tutorial, it was alright, opened another notebook, tried to run and encountered this issue.解决方法:全新安装TF 2.0并运行一个简单的Minst教程,没问题,打开另一个笔记本,尝试运行并遇到此问题。 I existed all notebooks and restarted Jupyter and open only one notebook, ran it successfully Issue seems to be either memory or running more than one notebook on GPU我存在所有笔记本并重新启动 Jupyter 并仅打开一个笔记本,成功运行问题似乎是内存或在 GPU 上运行多个笔记本

Thanks谢谢

I got same problem with you and my config is tensorflow1.13.1,cuda10.0,cudnn7.6.4.我和你有同样的问题,我的配置是 tensorflow1.13.1、cuda10.0、cudnn7.6.4。 I try to change cudnn's version to 7.4.2 lucky, I solve the problem.我尝试将 cudnn 的版本更改为 7.4.2 幸运的是,我解决了问题。

Enabling memory growth on GPU at the start of my code solved the problem:在我的代码开始时在 GPU 上启用内存增长解决了这个问题:

import tensorflow as tf

physical_devices = tf.config.experimental.list_physical_devices('GPU')
print("Num GPUs Available: ", len(physical_devices))
tf.config.experimental.set_memory_growth(physical_devices[0], True)

Num GPUs Available: 1可用的 GPU 数量:1

Reference: https://deeplizard.com/learn/video/OO4HD-1wRN8参考: https : //deeplizard.com/learn/video/OO4HD-1wRN8

in starting of your notebook or code add below lines of code在您的笔记本或代码的开头添加以下代码行

import tensorflow as tf

physical_devices = tf.config.experimental.list_physical_devices('GPU')

tf.config.experimental.set_memory_growth(physical_devices[0], True)

I had a similar problem.我有一个类似的问题。 Tensorflow complained that it expected a certain version of cuDNN but wasn't the one it found. Tensorflow 抱怨说它期望某个版本的 cuDNN 但不是它找到的那个版本。 So, I downloaded the version it expected from https://developer.nvidia.com/rdp/cudnn-archive and installed it.所以,我从https://developer.nvidia.com/rdp/cudnn-archive下载了它预期的版本并安装了它。 It now works.它现在可以工作了。

If you have installed Tensorflow-gpu using Conda, then install the cudnn and cudatoolkit which were installed along with it and re-run the notebook.如果你已经使用 Conda 安装了 Tensorflow-gpu,那么安装随它一起安装的cudnncudatoolkit并重新运行笔记本。

NOTE : Trying to uninstall only these two packages in conda would force a chain of other packages to be uninstalled as well.注意:尝试在 conda 中仅卸载这两个软件包会强制卸载一系列其他软件包。 So, use the following command to uninstall only these packages因此,使用以下命令仅卸载这些软件包

(1) To remove the cuda (1)删除cuda

conda remove --force cudatookit

(2) To remove the cudnn (2)去除cudnn

conda remove --force cudnn

Now run Tensorflow, it should work!现在运行 Tensorflow,它应该可以工作了!

Without any rep I can't add this as a comment to the two existing answers above from Anurag and Obnebion, neither can I upvote the answers, so I make a new answer even though it seems to be breaking guidelines.没有任何代表,我无法将其添加为对以上 Anurag 和 Obnebion 的两个现有答案的评论,我也不能对答案进行投票,因此即使它似乎违反了指导方针,我也做出了一个新答案。 Anyway, I originally had the problem that the other answers on this page address, and fixed it, but then re-encountered the same message later on when I started to use checkpoint callbacks.无论如何,我最初遇到了这个页面上的其他答案的问题,并修复了它,但后来当我开始使用检查点回调时再次遇到相同的消息。 At this point, only the Anurag/Obnebion answer was relevant.在这一点上,只有 Anurag/Obnebion 的答案是相关的。 It turns out I'd originally been saving the model as a .json and the weights separately as .h5, then using model_from_json along with a separate model.load_weights to get the weights back again.事实证明,我最初将模型保存为 .json,将权重分别保存为 .h5,然后使用 model_from_json 和单独的 model.load_weights 再次获得权重。 That worked (I have CUDA 10.2 and tensorflow 2.x).那行得通(我有 CUDA 10.2 和 tensorflow 2.x)。 It's only when I tried to switch to this all-in-one save/load_model from the checkpoint callback that it broke.只有当我试图从检查点回调切换到这个多合一的 save/load_model 时,它才会损坏。 This is the small change I made to keras.callbacks.ModelCheckpoint in the _save_model method:这是我在 _save_model 方法中对 keras.callbacks.ModelCheckpoint 所做的小改动:

                            if self.save_weights_only:
                                self.model.save_weights(filepath, overwrite=True)
                            else:
                                model_json = self.model.to_json()
                                with open(filepath+'.json','w') as fb:
                                    fb.write(model_json)
                                    fb.close()
                                self.model.save_weights(filepath+'.h5', overwrite=True)
                                with open(filepath+'-hist.pickle','wb') as fb:
                                    trainhistory = {"history": self.model.history.history,"params": self.model.history.params}
                                    pickle.dump(trainhistory,fb)
                                    fb.close()
                                # self.model.save(filepath, overwrite=True)

The history pickle dump is just a kludge for yet another question on stack overflow, what happens to the history object when you exit early from a Checkpoint callback.历史泡菜转储只是关于堆栈溢出的另一个问题的杂乱无章,当您从检查点回调中提前退出时,历史对象会发生什么。 Well you can see in the _save_model method there is a line which pulls the loss monitor array out of the logs dict... but never writes it to a file!好吧,您可以在 _save_model 方法中看到有一行将损失监视器数组从日志字典中拉出...但从未将其写入文件! So I just put in the kludge accordingly.所以我只是相应地放入了kludge。 Most people don't recommend using pickles like this.大多数人不建议像这样使用泡菜。 My code is just a hack so it doesn't matter.我的代码只是一个黑客所以没关系。

It seems like the libraries need some warm up.看起来图书馆需要一些热身。 This isn't an effective solution for production but you can at least carry on with other bugs...这不是生产的有效解决方案,但您至少可以继续处理其他错误......

from keras.models import Sequential
import numpy as np
from keras.layers import Dense
from keras.datasets import mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
model = Sequential()
model.add(Dense(1000,input_dim=(784),activation='relu') )  #imnput layer
model.add(Dense(222,activation='relu'))                     #hidden layer
model.add(Dense(100,activation='relu'))   
model.add(Dense(50,activation='relu'))   
model.add(Dense(10,activation='sigmoid'))   
model.compile(optimizer="adam",loss='categorical_crossentropy',metrics=["accuracy"])
x_train = np.reshape(x_train,(60000,784))/255
x_test = np.reshape(x_test,(10000,784))/255
from keras.utils import np_utils
y_train = np_utils.to_categorical(y_train) 
y_test = np_utils.to_categorical(y_test)
model.fit(x_train[:1000],y_train[:1000],epochs=1,batch_size=32)

如果您是中国人,请确保您的工作路径不包含中文,并将您的batch_size 更改得越来越小。谢谢!

Just install TensorFlow with GPU with this command : pip install tensorflow ;只需使用以下命令安装带有 GPU 的 TensorFlow: pip install tensorflow You don't need to install GPU separately.您不需要单独安装 GPU。 If you install GPU separately then this is a high chance it will mismatch the versions of them.如果单独安装 GPU,则很可能会与它们的版本不匹配。

But For releases 1.15 and older, CPU and GPU packages are separate.但是对于 1.15 及更早的版本,CPU 和 GPU 包是分开的。

I struggled with this for a while working on an AWS Ubuntu instance.我在 AWS Ubuntu 实例上为此苦苦挣扎了一段时间。

Then, I found the solution, which was quite simple in this case.然后,我找到了解决方案,在这种情况下非常简单。

Do not install tensorflow-gpu with pip ( pip install tensorflow-gpu ), but with conda ( conda install tensorflow-gpu ) so that it is in the conda environment and it installs the cudatoolkit and the cudnn in the right environment.不要使用 pip ( pip install tensorflow-gpu ) pip install tensorflow-gpu ,而是使用conda install tensorflow-gpu ( conda install tensorflow-gpu ),以便它位于 conda 环境中,并在正确的环境中安装 cudatoolkit 和 cudnn。

That worked for me, saved my day, and hope it helps somebody else.这对我有用,挽救了我的一天,希望它可以帮助其他人。

See the original solution here from learnermaxRL: https://github.com/tensorflow/tensorflow/issues/24828#issuecomment-453727142请参阅 learnermaxRL 的原始解决方案: https://github.com/tensorflow/tensorflow/issues/24828#issuecomment-453727142 : https://github.com/tensorflow/tensorflow/issues/24828#issuecomment-453727142

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 cuDNN 错误 无法获得卷积算法。 这可能是因为 cuDNN 初始化失败 - cuDNN Error Failed to get convolution algorithm. This is probably because cuDNN failed to initialize 无法创建 cudnn 句柄:CUDNN_STATUS_ALLOC_FAILED,无法获得卷积算法。 这可能是因为 cuDNN 未能初始化 - Could not create cudnn handle: CUDNN_STATUS_ALLOC_FAILED , Failed to get convolution algorithm. This is probably because cuDNN failed to initialize Tensorflow 错误。 获取卷积算法失败。 这可能是因为 cuDNN 未能初始化 - Tensorflow error. Failed to get convolution algorithm. This is probably because cuDNN failed to initialize (0) 未知:获取卷积算法失败。 这可能是因为 cuDNN 未能初始化 - (0) Unknown: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize CUDA Tensorflow 版本,nvidia-smi 版本问题。 获取卷积算法失败。 这可能是因为 cuDNN 初始化失败, - CUDA Tensorflow Version ,nvidia-smi version issue. Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, 获取卷积算法失败。 这可能是因为 cuDNN 初始化失败,所以尝试查看是否打印了警告日志消息 - Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed Google Colab 错误:无法获得卷积算法。这可能是因为 cuDNN 初始化失败 - Google Colab Error : Failed to get convolution algorithm.This is probably because cuDNN failed to initialize CNN error Failed to get convolution algorithm.这可能是因为cuDNN初始化失败, - CNN error Failed to get convolution algorithm.This is probably because cuDNN failed to initialize, 获取卷积算法失败。这可能是因为cuDNN初始化失败。[{node conv2d_1/Conv2D}] - Failed to get convolution algorithm.This is probably because cuDNN failed to initialize.[{node conv2d_1/Conv2D}] 获取卷积算法失败。 在 Tensorflow 图像识别中 - Failed to get convolution algorithm. In Tensorflow Image Recognition
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM