如何在 tensorflow 中获取当前可用的 GPU？

Question

我计划使用分布式 TensorFlow，我看到 TensorFlow 可以使用 GPU 进行训练和测试。 在集群环境中，每台机器可以有 0 个或 1 个或更多 GPU，我想在尽可能多的机器上将我的 TensorFlow 图运行到 GPU 中。

我发现当运行tf.Session() TensorFlow 时，会在日志消息中提供有关 GPU 的信息，如下所示：

I tensorflow/core/common_runtime/gpu/gpu_init.cc:126] DMA: 0 
I tensorflow/core/common_runtime/gpu/gpu_init.cc:136] 0:   Y 
I tensorflow/core/common_runtime/gpu/gpu_device.cc:838] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1080, pci bus id: 0000:01:00.0)

我的问题是如何从 TensorFlow 获取有关当前可用 GPU 的信息？ 我可以从日志中获取 loaded GPU 信息，但我想以更复杂的编程方式进行。 我也可以使用 CUDA_VISIBLE_DEVICES 环境变量有意限制 GPU，所以我不想知道从操作系统 kernel 获取 GPU 信息的方法。

简而言之，我想要一个像tf.get_available_gpus()这样的 function，如果机器中有两个可用的 GPU，它将返回['/gpu:0', '/gpu:1'] 。 我该如何实施？

Answer 1

有一个名为device_lib.list_local_devices()的未device_lib.list_local_devices()方法，它使您能够列出本地进程中可用的设备。 （注意作为一种未记录的方法，这会受到向后不兼容的更改。）该函数返回DeviceAttributes协议缓冲区对象的列表。 您可以提取 GPU 设备的字符串设备名称列表，如下所示：

from tensorflow.python.client import device_lib

def get_available_gpus():
    local_device_protos = device_lib.list_local_devices()
    return [x.name for x in local_device_protos if x.device_type == 'GPU']

请注意（至少到 TensorFlow 1.4），调用device_lib.list_local_devices()将运行一些初始化代码，默认情况下，这些代码将分配所有设备上的所有 GPU 内存（ GitHub 问题）。 为避免这种情况，首先创建一个显式较小的per_process_gpu_fraction或allow_growth=True的会话，以防止分配所有内存。 有关更多详细信息，请参阅此问题。

Answer 2

您可以使用以下代码检查所有设备列表：

from tensorflow.python.client import device_lib

device_lib.list_local_devices()

Answer 3

test util 中还有一个方法。 所以所要做的就是：

tf.test.is_gpu_available()

和/或

tf.test.gpu_device_name()

查找 Tensorflow 文档以获取参数。

Answer 4

在 TensorFlow 2.0 中，您可以使用tf.config.experimental.list_physical_devices('GPU') ：

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

如果您安装了两个 GPU，它会输出以下内容：

Name: /physical_device:GPU:0   Type: GPU
Name: /physical_device:GPU:1   Type: GPU

从 2.1 开始，您可以删除experimental ：

gpus = tf.config.list_physical_devices('GPU')

见：

指南页
当前API

Answer 5

接受的答案为您提供了 GPU 的数量，但它还分配了这些 GPU 上的所有内存。 您可以通过在调用 device_lib.list_local_devices() 之前创建具有固定较低内存的会话来避免这种情况，这对于某些应用程序来说可能是不需要的。

我最终使用 nvidia-smi 来获取 GPU 的数量，而没有为它们分配任何内存。

import subprocess

n = str(subprocess.check_output(["nvidia-smi", "-L"])).count('UUID')

Answer 6

除了 Mrry 的出色解释，他建议使用device_lib.list_local_devices()我可以向您展示如何从命令行检查 GPU 相关信息。

因为目前只有 Nvidia 的 gpus 适用于 NN 框架，所以答案仅涵盖它们。 Nvidia 有一个页面，其中记录了如何使用 /proc 文件系统接口获取有关驱动程序、任何已安装的 NVIDIA 显卡和 AGP 状态的运行时信息。

/proc/driver/nvidia/gpus/0..N/information

提供有关每个已安装 NVIDIA 图形适配器的信息（型号名称、IRQ、BIOS 版本、总线类型）。 请注意，BIOS 版本仅在 X 运行时可用。

因此，您可以从命令行cat /proc/driver/nvidia/gpus/0/information并查看有关您的第一个 GPU 的信息。 从 python 运行它很容易，你也可以检查第二、第三、第四个 GPU，直到它失败。

肯定 Mrry 的答案更可靠，我不确定我的答案是否适用于非 linux 机器，但 Nvidia 的页面提供了其他有趣的信息，但很少有人知道。

Answer 7

以下在 tensorflow 2 中有效：

import tensorflow as tf
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    print("Name:", gpu.name, "  Type:", gpu.device_type)

从 2.1 开始，您可以删除experimental ：

    gpus = tf.config.list_physical_devices('GPU')

https://www.tensorflow.org/api_docs/python/tf/config/list_physical_devices

Answer 8

我的机器中有一个名为NVIDIA GTX GeForce 1650 Ti tensorflow-gpu==2.2.0 ，其tensorflow-gpu==2.2.0

运行以下两行代码：

import tensorflow as tf
print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

输出：

Num GPUs Available:  1

Answer 9

在 TensorFlow Core v2.3.0 中，以下代码应该可以工作。

import tensorflow as tf
visible_devices = tf.config.get_visible_devices()
for devices in visible_devices:
  print(devices)

根据您的环境，此代码将产生流畅的结果。

PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU') PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')

Answer 10

我正在研究 TF-2.1 和火炬，所以我不想在任何 ML 框架中指定这个自动选择。 我只是使用原始的nvidia-smi和os.environ来获得一个空的 gpu。

def auto_gpu_selection(usage_max=0.01, mem_max=0.05):
"""Auto set CUDA_VISIBLE_DEVICES for gpu

:param mem_max: max percentage of GPU utility
:param usage_max: max percentage of GPU memory
:return:
"""
os.environ['CUDA_DEVICE_ORDER'] = 'PCI_BUS_ID'
log = str(subprocess.check_output("nvidia-smi", shell=True)).split(r"\n")[6:-1]
gpu = 0

# Maximum of GPUS, 8 is enough for most
for i in range(8):
    idx = i*3 + 2
    if idx > log.__len__()-1:
        break
    inf = log[idx].split("|")
    if inf.__len__() < 3:
        break
    usage = int(inf[3].split("%")[0].strip())
    mem_now = int(str(inf[2].split("/")[0]).strip()[:-3])
    mem_all = int(str(inf[2].split("/")[1]).strip()[:-3])
    # print("GPU-%d : Usage:[%d%%]" % (gpu, usage))
    if usage < 100*usage_max and mem_now < mem_max*mem_all:
        os.environ["CUDA_VISIBLE_EVICES"] = str(gpu)
        print("\nAuto choosing vacant GPU-%d : Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]\n" %
              (gpu, mem_now, mem_all, usage))
        return
    print("GPU-%d is busy: Memory:[%dMiB/%dMiB] , GPU-Util:[%d%%]" %
          (gpu, mem_now, mem_all, usage))
    gpu += 1
print("\nNo vacant GPU, use CPU instead\n")
os.environ["CUDA_VISIBLE_EVICES"] = "-1"

如果我可以获得任何 GPU，它会将CUDA_VISIBLE_EVICES设置为该gpu 的 BUSID：

GPU-0 is busy: Memory:[5738MiB/11019MiB] , GPU-Util:[60%]
GPU-1 is busy: Memory:[9688MiB/11019MiB] , GPU-Util:[78%]

Auto choosing vacant GPU-2 : Memory:[1MiB/11019MiB] , GPU-Util:[0%]

否则，设置为-1以使用 CPU：

GPU-0 is busy: Memory:[8900MiB/11019MiB] , GPU-Util:[95%]
GPU-1 is busy: Memory:[4674MiB/11019MiB] , GPU-Util:[35%]
GPU-2 is busy: Memory:[9784MiB/11016MiB] , GPU-Util:[74%]

No vacant GPU, use CPU instead

注意：在导入任何需要 GPU 的 ML 帧之前使用此功能，它可以自动选择一个 gpu。 此外，您可以轻松设置多个任务。

Answer 11

使用这种方式并检查所有部件：

from __future__ import absolute_import, division, print_function, unicode_literals

import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds


version = tf.__version__
executing_eagerly = tf.executing_eagerly()
hub_version = hub.__version__
available = tf.config.experimental.list_physical_devices("GPU")

print("Version: ", version)
print("Eager mode: ", executing_eagerly)
print("Hub Version: ", h_version)
print("GPU is", "available" if avai else "NOT AVAILABLE")

Answer 12

确保您的 GPU 支持机器中安装了最新的TensorFlow 2.x GPU，在 python 中执行以下代码，

from __future__ import absolute_import, division, print_function, unicode_literals

import tensorflow as tf 

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

会得到一个输出看起来像，

2020-02-07 10:45:37.587838: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] 成功从 SysFS 读取的 NUMA 节点具有负值 (-1)，但必须至少有一个 NUMA 节点，因此返回NUMA 节点零 2020-02-07 10:45:37.588896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] 添加可见 gpu 设备：0, 1, 2, 3, 4, 5, 6, 7 Num可用 GPU：8

Answer 13

tensorflow推荐的最新版本：

tf.config.list_physical_devices('GPU')

Answer 14

在任意 shell 中运行以下命令

python -c "import tensorflow as tf; print(\"Num GPUs Available: \", len(tf.config.list_physical_devices('GPU')))"

Answer 15

您可以使用以下代码字段来显示设备名称、类型、memory 和地区。

from tensorflow.python.client import device_lib
print(device_lib.list_local_devices())

如何在 tensorflow 中获取当前可用的 GPU？

问题描述

15 个解决方案

解决方案1
291 已采纳 2016-07-26 02:34:21

解决方案2
151 2017-07-19 06:52:44

解决方案3
54 2018-06-22 06:06:09

解决方案4
46 2019-06-03 01:35:55

解决方案5
22 2018-10-12 04:22:29

解决方案6
9 2017-07-29 04:31:12

解决方案7
7 2019-10-07 03:50:01

解决方案8
3 2020-05-30 10:57:00

解决方案9
3 2020-11-19 07:58:03

解决方案10
1 2020-08-02 07:59:49

解决方案11
0 2020-01-16 09:16:48

解决方案12
0 2020-02-07 12:47:38

解决方案13
0 2021-12-14 10:47:46

解决方案14
0 2022-04-03 20:48:48

解决方案15
0 2023-01-13 06:50:12

如何在 tensorflow 中获取当前可用的 GPU？

问题描述

15 个解决方案

解决方案1 291 已采纳 2016-07-26 02:34:21

解决方案2 151 2017-07-19 06:52:44

解决方案3 54 2018-06-22 06:06:09

解决方案4 46 2019-06-03 01:35:55

解决方案5 22 2018-10-12 04:22:29

解决方案6 9 2017-07-29 04:31:12

解决方案7 7 2019-10-07 03:50:01

解决方案8 3 2020-05-30 10:57:00

解决方案9 3 2020-11-19 07:58:03

解决方案10 1 2020-08-02 07:59:49

解决方案11 0 2020-01-16 09:16:48

解决方案12 0 2020-02-07 12:47:38

解决方案13 0 2021-12-14 10:47:46

解决方案14 0 2022-04-03 20:48:48

解决方案15 0 2023-01-13 06:50:12

解决方案1
291 已采纳 2016-07-26 02:34:21

解决方案2
151 2017-07-19 06:52:44

解决方案3
54 2018-06-22 06:06:09

解决方案4
46 2019-06-03 01:35:55

解决方案5
22 2018-10-12 04:22:29

解决方案6
9 2017-07-29 04:31:12

解决方案7
7 2019-10-07 03:50:01

解决方案8
3 2020-05-30 10:57:00

解决方案9
3 2020-11-19 07:58:03

解决方案10
1 2020-08-02 07:59:49

解决方案11
0 2020-01-16 09:16:48

解决方案12
0 2020-02-07 12:47:38

解决方案13
0 2021-12-14 10:47:46

解决方案14
0 2022-04-03 20:48:48

解决方案15
0 2023-01-13 06:50:12