简体   繁体   English

Tensorflow GPU / CUDA 在 Ubuntu 上的安装

[英]Tensorflow GPU / CUDA installation on Ubuntu

I have set up a Ubuntu 18.04 and tried to make Tensorflow 2.2 GPU work (I have an Nvidia/CUDA graphic card) with Python.我已经设置了一个 Ubuntu 18.04 并尝试使用 Python 使 Tensorflow 2.2 GPU 工作(我有一个 Nvidia/CUDA 显卡)。 Even after reading the documentation https://www.tensorflow.org/install/gpu#linux_setup , it failed (see below for details about how it failed).即使在阅读文档https://www.tensorflow.org/install/gpu#linux_setup 之后,它也失败了(有关它如何失败的详细信息,请参见下文)。

Question: would you have a canonical "todo" list (starting point: freshly installed Ubuntu server) on how to install tensorflow-gpu and make it work, with a few steps?问题:您是否有一个规范的“待办事项”列表(起点:新安装的 Ubuntu 服务器)关于如何通过几个步骤安装tensorflow-gpu并使其工作?

Notes:笔记:

  • I have read many similar forum posts, and I think that having a canonical "todo" (from a fresh Ubuntu install to having tensorflow-gpu working) would be interesting, with a few steps/bash commands我读过很多类似的论坛帖子,我认为有一个规范的“todo”(从全新的 Ubuntu 安装到让tensorflow-gpu工作)会很有趣,只需几个步骤/bash 命令

  • the documentation I used involved我使用的文档涉及

     export LD_LIBRARY_PATH... # Add NVIDIA package repository sudo apt-key adv --fetch-keys http://developer.download... ... # Install CUDA and tools. Include optional NCCL 2.x sudo apt install cuda9.0 cuda...

    Even after a lot of trial and errors (I don't copy/paste all the different errors here, would be too long), then at the end:即使经过大量的试验和错误(我不会在这里复制/粘贴所有不同的错误,会太长),然后在最后:

     import tensorflow

    always failed.总是失败。 Some reasons included `ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory.一些原因包括`ImportError: libcublas.so.9.0: cannot open shared object file: No such file or directory。 I have already read the relevant question here , or this very long (!) Github issue .我已经在这里阅读了相关问题,或者这个很长的 (!) Github 问题

  • After some trial and error, import tensorflow works, but it doesn't use the GPU (see also Tensorflow not running on GPU ).经过一些反复试验, import tensorflow可以工作,但它不使用 GPU(另请参阅Tensorflow not running on GPU )。

Well, I was facing the same problem.好吧,我面临着同样的问题。 The first thing to do is to look up, which Tensorflow version is required.首先要做的是查找,需要哪个Tensorflow版本。 In your case Tensorflow 2.2 .在你的情况下Tensorflow 2.2 requires CUDA 10.1 .需要CUDA 10.1 The correct cuDNN version is also important.正确的 cuDNN 版本也很重要。 In your case it would be cuDNN 7.4 .在您的情况下,它将是cuDNN 7.4 An additional point is the installed python version.另外一点是安装的python版本。 I would recommend Python 3.5-3.8 .我会推荐Python 3.5-3.8 If one those mismatch, a fully compatibility is almost impossible.如果其中一个不匹配,则完全兼容几乎是不可能的。

So if you want a check list, here you go:因此,如果您想要一份检查清单,请访问:

  1. Install CUDA 10.1 by installing nvidia-cuda-toolkit.通过安装 nvidia-cuda-toolkit 来安装 CUDA 10.1。
  2. Install the cuDNN version compatible with CUDA 10.1.安装与 CUDA 10.1 兼容的 cuDNN 版本。
  3. Export CUDA environment variables.导出 CUDA 环境变量。
  4. If Bazel is not installed, you will be asked on that.如果未安装 Bazel,系统会询问您。
  5. Install TensorFlow 2.2 using pip.使用 pip 安装 TensorFlow 2.2。 I would highly recommend the usage of a virtual environment.我强烈推荐使用虚拟环境。

You can find the compatibility check list of Tensorflow and CUDA here你可以在这里找到 Tensorflow 和 CUDA 的兼容性检查列表

You can find the CUDA Toolkit here您可以在此处找到 CUDA 工具包

Finally get cuDNN in the correct version here最终在此处获得正确版本的 cuDNN

That's all.就这样。

I faced the problem as well when using the Google Cloud Platform for two projects involving deep learning. 在将Google Cloud Platform用于涉及深度学习的两个项目时,我也遇到了问题。 They provide servers with nothing but a freshly installed Ubuntu OS. 他们为服务器提供的只是全新安装的Ubuntu OS。 Regarding my experience, I recommend doing the following steps: 根据我的经验,我建议执行以下步骤:

  • Look up the cuda and cuDNN version supported by the current Tensorflow release on the Tensorflow page . 在Tensorflow页面上查找当前Tensorflow版本支持的cuda和cuDNN版本。
  • Install the targeted cuda version from the deb package retrieved from Nvidias cuda page and be careful that more recent cuda versions might not work! Nvidias cuda页面检索的deb软件包中安装目标cuda版本,请注意,较新的cuda版本可能无法正常工作! This will automatically install the corresponding Nvidia drivers. 这将自动安装相应的Nvidia驱动程序。
  • Install the targeted cuDNN version from this page and again be careful that a more recent cuDNN version might not work . 从此页面安装目标cuDNN版本,并再次注意可能无法使用最新的cuDNN版本
  • Install tensorflow-gpu using pip. 使用pip安装tensorflow-gpu。

This should work. 这应该工作。 Your problem is probably that you are using a more recent cuda version than targeted by the current Tensorflow release. 您的问题可能是您使用的是最新的cuda版本,而不是当前Tensorflow版本所针对的版本。

To install tensorflow-gpu, the guidelines which are provided on official website are very tedious for beginers, instead we can do these simple steps: 要安装tensorflow-gpu,官方网站上提供的指南对于初学者来说非常繁琐,相反,我们可以执行以下简单步骤:

Note : NVIDIA driver must be installed before this(you can verify this using command nvidia-smi). 注意:在此之前必须先安装NVIDIA驱动程序(您可以使用命令nvidia-smi进行验证)。

  1. Install Anaconda https://www.anaconda.com/distribution/ ? 安装Anaconda https://www.anaconda.com/distribution/吗?
  2. Create an virtual environment using command "conda create -n envname" 使用命令“ conda create -n envname”创建虚拟环境
  3. Then activate env using command "conda activate envname" 然后使用命令“ conda activate envname”激活环境。
  4. Finally install tensorflow using command "conda install tensorflow-gpu" 最后使用命令“ conda install tensorflow-gpu”安装tensorflow

With the given code 用给定的代码

import tensorflow as tf
      if tf.test.gpu_device_name():
           print('Default GPU Device{}'.format(tf.test.gpu_device_name()))
      else:
           print("not using gpu")

You can find the tutorial on link given below https://www.pugetsystems.com/labs/hpc/Install-TensorFlow-with-GPU-Support-the-Easy-Way-on-Ubuntu-18-04-without-installing-CUDA-1170/ ? 您可以在下面提供的链接上找到该教程https://www.pugetsystems.com/labs/hpc/Install-TensorFlow-with-GPU-Support-the-Easy-Way-on-Ubuntu-18-04-without-installing -CUDA-1170 /

I would suggest to first check the availability of GPU using nvidia-smi command.我建议首先使用nvidia-smi命令检查 GPU 的可用性。

I had faced the same issue, i was able to resolve it by using docker container, you can install docker using Install Docker Engine on Ubuntu or use the Digital Ocean guide (i used this one) How To Install and Use Docker on Ubuntu 18.04我遇到了同样的问题,我能够通过使用 docker 容器来解决它,您可以在 Ubuntu 上使用Install Docker Engine 安装 docker或使用 Digital Ocean 指南(我使用过这个) How To Install and Use Docker on Ubuntu 18.04

After that it is simple just run the following command based on the requirements之后就很简单了,根据需求运行以下命令即可

NV_GPU='0' nvidia-docker run --runtime=nvidia -it -v /path/to/folder:/path/to/folder/for/docker/container nvcr.io/nvidia/tensorflow:17.11

NV_GPU='0' nvidia-docker run --runtime=nvidia -it -v /storage/research/:/storage/research/ nvcr.io/nvidia/tensorflow:20.12-tf2-py3

Here '0' represents the GPU number, if you want to use more than one GPU just use '0,1,2' and so on ....这里'0'代表GPU编号,如果你想使用多个GPU就使用'0,1,2'等等......

Hope this solves the issue.希望这能解决问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM