Blas GEMM launch failed :

Question

I am trying to make a dense classifier on top of a pre-trained CNN model. A working GPU is configured and tensorflow is also using the GPU for operations. My env is not created by anaconda, it has the followings packages: IDE - Pycharm, TF = 2.4.0, CUDA = 11.0

But I am not unable to get output due to

Blas GEMM launch failed: a.shape=(20, 8192), b.shape=(8192, 256), m=20, n=256, k=8192

It also shows that

failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED

I have checked in the sites and tried to set

configuration.gpu_options.allow_growth = True

But it did not help. I also tried the following code:

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        # Currently, memory growth needs to be the same across GPUs
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
        logical_gpus = tf.config.experimental.list_logical_devices('GPU')
        print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        # Memory growth must be set before GPUs have been initialized
        print(e)

But the output error remains the same. In differenct stack answers I found some other solutions which also did not work.

My code looks like this

import os
import numpy as np
import shutil
import matplotlib.pyplot as plt

import tensorflow as tf
print(tf.test.gpu_device_name())

print(tf.__version__)


from tensorflow.keras import layers
from tensorflow.keras import models
from tensorflow.keras import optimizers
from tensorflow.keras.preprocessing.image import ImageDataGenerator



from tensorflow.keras.applications import VGG16
conv_base = VGG16(weights = 'imagenet', include_top = False, input_shape = (150,150,3))

base_dir = 'C:/Users/emamu/Downloads/cat_and_dog_small'
train_dir = os.path.join(base_dir, 'train')
val_dir = os.path.join(base_dir, 'val')
test_dir = os.path.join(base_dir, 'test')

data_generator_unaugmented = ImageDataGenerator(1./255)
batch_size = 20

def extract_feature(directory, sample_count):
    features = np.zeros(shape=(sample_count, 4, 4, 512))
    labels = np.zeros(shape=(sample_count))

    generator = data_generator_unaugmented.flow_from_directory(directory,
                                                               target_size = (150, 150),
                                                               batch_size = batch_size,
                                                               class_mode = 'binary')

    i = 0
    for input_batch, label_batch in generator:
        feature_batch = conv_base.predict(input_batch)
        features[i*batch_size:(i+1)*batch_size] = feature_batch
        labels[i*batch_size:(i+1)*batch_size] = label_batch
        i=i+1
        if i*batch_size >= sample_count:
            break
    return features, labels

train_feature, train_label = extract_feature(train_dir, 2000)
val_feature, val_label = extract_feature(val_dir, 1000)
test_feature, test_label = extract_feature(test_dir, 1000)

train_feature = np.reshape(train_feature, (2000, 4*4*512))
val_feature = np.reshape(val_feature, (1000, 4*4*512))
test_feature = np.reshape(test_feature, (1000, 4*4*512))


# Creating a classifier model on top

model = models.Sequential()
model.add(layers.Dense(256, activation='relu', input_dim=4*4*512))
model.add(layers.Dropout(0.4))
model.add(layers.Dense(1, activation='sigmoid'))

model.compile(optimizer=optimizers.RMSprop(lr=1e-5), loss='binary_crossentropy', metrics=['acc'])

history = model.fit(train_feature, train_label,
                    epochs = 30, batch_size = 20,
                    validation_data=(val_feature, val_label))

acc = history.history['acc']
val_acc = history.history['val_accuracy']
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(acc)+1)
plt.plot(epochs, acc, label = 'training accuracy')
plt.plot(epochs, val_acc, label = 'val accuracy')
plt.figure()
plt.show()

plt.plot(epochs, loss, label = 'training loss')
plt.plot(epochs, val_loss, label = 'val loss')
plt.figure()
plt.show()

Whatever I try, I am getting the following error:

C:\Users\emamu\PycharmProjects\gput_test_01\venv\Scripts\python.exe C:/Users/emamu/PycharmProjects/gput_test_01/cnn_transfer_learning.py
2021-05-06 12:00:03.592087: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:05.532458: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-05-06 12:00:05.534984: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library nvcuda.dll
2021-05-06 12:00:05.559708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 178.84GiB/s
2021-05-06 12:00:05.559871: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:05.568725: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:05.568814: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-06 12:00:05.572032: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-06 12:00:05.573585: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-06 12:00:05.580959: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-06 12:00:05.583500: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-06 12:00:05.584151: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:05.584333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
/device:GPU:0
2.4.0
2021-05-06 12:00:06.133316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-06 12:00:06.133441: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-05-06 12:00:06.133513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-05-06 12:00:06.133740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/device:GPU:0 with 2903 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-05-06 12:00:06.134445: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-06 12:00:06.145654: I tensorflow/compiler/jit/xla_cpu_device.cc:41] Not creating XLA devices, tf_xla_enable_xla_devices not set
2021-05-06 12:00:06.146140: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 178.84GiB/s
2021-05-06 12:00:06.146343: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:06.146433: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:06.146522: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-06 12:00:06.146610: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-06 12:00:06.146699: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-06 12:00:06.146787: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-06 12:00:06.146880: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-06 12:00:06.146972: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:06.147095: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-05-06 12:00:06.147534: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: 
pciBusID: 0000:01:00.0 name: GeForce GTX 1650 Ti computeCapability: 7.5
coreClock: 1.485GHz coreCount: 16 deviceMemorySize: 4.00GiB deviceMemoryBandwidth: 178.84GiB/s
2021-05-06 12:00:06.147732: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudart64_110.dll
2021-05-06 12:00:06.147840: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:06.147933: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublasLt64_11.dll
2021-05-06 12:00:06.148021: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cufft64_10.dll
2021-05-06 12:00:06.148106: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library curand64_10.dll
2021-05-06 12:00:06.148193: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusolver64_10.dll
2021-05-06 12:00:06.148276: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cusparse64_11.dll
2021-05-06 12:00:06.148359: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:06.148464: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1862] Adding visible gpu devices: 0
2021-05-06 12:00:06.148562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1261] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-05-06 12:00:06.148646: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1267]      0 
2021-05-06 12:00:06.148699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1280] 0:   N 
2021-05-06 12:00:06.148810: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1406] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2903 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1650 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
2021-05-06 12:00:06.148979: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
Found 2000 images belonging to 2 classes.
C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\keras_preprocessing\image\image_data_generator.py:720: UserWarning: This ImageDataGenerator specifies `featurewise_center`, but it hasn't been fit on any training data. Fit it first by calling `.fit(numpy_data)`.
  warnings.warn('This ImageDataGenerator specifies '
2021-05-06 12:00:06.750321: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-05-06 12:00:06.876918: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cudnn64_8.dll
2021-05-06 12:00:08.024040: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2021-05-06 12:00:08.087066: I tensorflow/core/platform/windows/subprocess.cc:308] SubProcess ended with return code: 0

2021-05-06 12:00:08.222607: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library cublas64_11.dll
2021-05-06 12:00:08.720199: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:09.177801: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.66GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-05-06 12:00:09.178227: W tensorflow/core/common_runtime/bfc_allocator.cc:248] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.66GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available.
2021-05-06 12:00:09.257660: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:09.587331: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:10.196301: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:10.486611: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:11.055390: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:11.328883: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:11.828937: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:12.031743: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
Found 1000 images belonging to 2 classes.
Found 1000 images belonging to 2 classes.
Epoch 1/30
2021-05-06 12:00:45.989644: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.990645: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.991702: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.992462: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:45.993468: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:46.000724: E tensorflow/stream_executor/cuda/cuda_blas.cc:226] failed to create cublas handle: CUBLAS_STATUS_ALLOC_FAILED
2021-05-06 12:00:46.000869: W tensorflow/stream_executor/stream.cc:1455] attempting to perform BLAS operation using StreamExecutor without BLAS support
Traceback (most recent call last):
  File "C:/Users/emamu/PycharmProjects/gput_test_01/cnn_transfer_learning.py", line 92, in <module>
    history = model.fit(train_feature, train_label,
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\keras\engine\training.py", line 1100, in fit
    tmp_logs = self.train_function(iterator)
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 828, in __call__
    result = self._call(*args, **kwds)
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\def_function.py", line 888, in _call
    return self._stateless_fn(*args, **kwds)
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\function.py", line 2942, in __call__
    return graph_function._call_flat(
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\function.py", line 1918, in _call_flat
    return self._build_call_outputs(self._inference_function.call(
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\function.py", line 555, in call
    outputs = execute.execute(
  File "C:\Users\emamu\PycharmProjects\gput_test_01\venv\lib\site-packages\tensorflow\python\eager\execute.py", line 59, in quick_execute
    tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError:  Blas GEMM launch failed : a.shape=(20, 8192), b.shape=(8192, 256), m=20, n=256, k=8192
     [[node sequential/dense/MatMul (defined at /Users/emamu/PycharmProjects/gput_test_01/cnn_transfer_learning.py:92) ]] [Op:__inference_train_function_11890]

Function call stack:
train_function


Process finished with exit code 1

What is the actual reason for this error? Is there any permanent workaround for this issue?

Answer 1

I have solved the issue. The process is the same as tensorflow website states. But if there any older CUDA or cuDNN installed then first, erase all the CUDA drivers from PC and install visual studio. Then, the most important part is choosing the correct version of CUDA and cuDNN. If the version of TF is 2.4.x, then choose CUDA 11.0.x and cuDNN of 8.05 (or any version which support only CUDA 11.0.x).

To have more details about the versions, check this page. And again, the versions of tensorflow, CUDA and cuDNN are the most crucial part of tensorflow-gpu installation process.

Blas GEMM launch failed :

Question

1 answers

solution1
0 ACCPTED 2021-05-06 21:17:25

Blas GEMM launch failed :

Question

1 answers

solution1 0 ACCPTED 2021-05-06 21:17:25

solution1
0 ACCPTED 2021-05-06 21:17:25