如何檢查tensorflow是否正在使用所有可用的GPU

Question

我正在學習使用Tensorflow進行物體檢測。 為了加快培訓過程，我采用了一個擁有4個GPU的AWS g3.16xlarge實例。 我使用以下代碼來運行培訓過程：

export CUDA_VISIBLE_DEVICES=0,1,2,3
 python object_detection/train.py --logtostderr --pipeline_config_path=/home/ubuntu/builder/rcnn.config --train_dir=/home/ubuntu/builder/experiments/training/

在rcnn.config里面 - 我已經設置了batch-size = 1 。 在運行時，我得到以下輸出：

控制台輸出

2018-11-09 07:25:50.104310: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1045] Device peer to peer matrix
2018-11-09 07:25:50.104385: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1051] DMA: 0 1 2 3 
2018-11-09 07:25:50.104395: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 0:   Y N N N 
2018-11-09 07:25:50.104402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 1:   N Y N N 
2018-11-09 07:25:50.104409: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 2:   N N Y N 
2018-11-09 07:25:50.104416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1061] 3:   N N N Y 
2018-11-09 07:25:50.104429: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla M60, pci bus id: 0000:00:1b.0, compute capability: 5.2)
2018-11-09 07:25:50.104439: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla M60, pci bus id: 0000:00:1c.0, compute capability: 5.2)
2018-11-09 07:25:50.104446: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:2) -> (device: 2, name: Tesla M60, pci bus id: 0000:00:1d.0, compute capability: 5.2)
2018-11-09 07:25:50.104455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:3) -> (device: 3, name: Tesla M60, pci bus id: 0000:00:1e.0, compute capability: 5.2)

當我運行nvidia-smi ，我得到以下輸出： nvidia-smi輸出

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla M60           Off  | 0000:00:1B.0     Off |                    0 |
| N/A   52C    P0   129W / 150W |   7382MiB /  7612MiB |     92%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla M60           Off  | 0000:00:1C.0     Off |                    0 |
| N/A   33C    P0    38W / 150W |   7237MiB /  7612MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla M60           Off  | 0000:00:1D.0     Off |                    0 |
| N/A   40C    P0    38W / 150W |   7237MiB /  7612MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla M60           Off  | 0000:00:1E.0     Off |                    0 |
| N/A   34C    P0    39W / 150W |   7237MiB /  7612MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     97860    C   python                                        7378MiB |
|    1     97860    C   python                                        7233MiB |
|    2     97860    C   python                                        7233MiB |
|    3     97860    C   python                                        7233MiB |
+-----------------------------------------------------------------------------+

和**nvidia-smi dmon**提供以下輸出：

# gpu   pwr  temp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     %     %     %     %   MHz   MHz
    0   158    69    90    69     0     0  2505  1177
    1    38    36     0     0     0     0  2505   556
    2    38    45     0     0     0     0  2505   556
    3    39    37     0     0     0     0  2505   556

我對每個輸出感到困惑。 當我讀取控制台輸出時，程序正在識別4種不同gpus的可用性，在nvidia-smi輸出中，僅針對第一個GPU顯示易失性GPU-Util百分比，其余為零。 但是，同一個表打印底部所有4個gpu的內存使用情況。 並且nvidia-smi dmon僅為第一個gpu打印sm值，而其他值則為零。 從這篇博客中我了解到dmon的零表示GPU是免費的。

我想要理解的是，train.py是否利用了我實例中的所有4個GPU。 如果沒有使用所有GPU，我如何確保針對所有GPU優化了tensorflow的object_detection/train.py 。

Answer 1

檢查它是否返回所有GPU的列表。

tf.test.gpu_device_name()

返回GPU設備的名稱（如果可用）或空字符串。

那么你可以做這樣的事情來使用所有可用的GPU。

# Creates a graph.
c = []
for d in ['/device:GPU:2', '/device:GPU:3']:
  with tf.device(d):
    a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
    b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
    c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
  sum = tf.add_n(c)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.
print(sess.run(sum))

你看下面的輸出：

Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: Tesla K20m, pci bus
id: 0000:02:00.0
/job:localhost/replica:0/task:0/device:GPU:1 -> device: 1, name: Tesla K20m, pci bus
id: 0000:03:00.0
/job:localhost/replica:0/task:0/device:GPU:2 -> device: 2, name: Tesla K20m, pci bus
id: 0000:83:00.0
/job:localhost/replica:0/task:0/device:GPU:3 -> device: 3, name: Tesla K20m, pci bus
id: 0000:84:00.0
Const_3: /job:localhost/replica:0/task:0/device:GPU:3
Const_2: /job:localhost/replica:0/task:0/device:GPU:3
MatMul_1: /job:localhost/replica:0/task:0/device:GPU:3
Const_1: /job:localhost/replica:0/task:0/device:GPU:2
Const: /job:localhost/replica:0/task:0/device:GPU:2
MatMul: /job:localhost/replica:0/task:0/device:GPU:2
AddN: /job:localhost/replica:0/task:0/cpu:0
[[  44.   56.]
 [  98.  128.]]

Answer 2

用於檢查GPU是否已找到且可用於tensorflow Python代碼：

## Libraries import
import tensorflow as tf

## Test GPU
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
print('')
config = tf.ConfigProto()
config.gpu_options.allow_growth = True

如何檢查tensorflow是否正在使用所有可用的GPU

問題描述

2 個解決方案

解決方案1
4 2018-11-09 07:45:10

解決方案2
0 2019-07-13 10:07:42

如何檢查tensorflow是否正在使用所有可用的GPU

問題描述

2 個解決方案

解決方案1 4 2018-11-09 07:45:10

解決方案2 0 2019-07-13 10:07:42

解決方案1
4 2018-11-09 07:45:10

解決方案2
0 2019-07-13 10:07:42