简体   繁体   中英

A100 tensorflow gpu error: "Failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error"

I am trying to run tensorflow with gpu support in a docker on a virtual machine. I have tried lots of online solutions including:

none of the solutions work for me, here some steps:

I verified that drivers and cuda and cudnn toolkit are installed inside the container using nvidia-smi and nvcc -V:

在此处输入图片说明

在此处输入图片说明

在此处输入图片说明

Python version is : Python 3.8.10

and tensorflow version is:

import tensorflow as tf 
tf.__version__
'2.6.0'

The error appears with: tf.config.list_physical_devices()

在此处输入图片说明

So the GPU is somehow not visible to the tensorflow. All tensorflow builds return the same error:

 E tensorflow/stream_executor/cuda/cuda_driver.cc:271] failed call to cuInit: CUDA_ERROR_NOT_INITIALIZED: initialization error

but for example for 1.14 there is an additional comment regarding the CPU type:

Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA

在此处输入图片说明

The GPU is a A100 and the CPU is Intel(R) Xeon(R) Gold 6226R.

What is going on here? How do I fix this?

I realized that the GPU has a multi-instance feature:

在此处输入图片说明

Therefore, the GPU instances should be configured:

sudo nvidia-smi mig -cgi 0 -C 

在此处输入图片说明

and afterwards when calling nvidia-smi you get:

在此处输入图片说明

And the problem is solved!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM