Tensorflow：內存增長不能因 GPU 設備而異 | 如何在 tensorflow 中使用多 GPU

Question

我正在嘗試在集群內的 GPU 節點上運行 keras 代碼。 GPU 節點每個節點有 4 個 GPU。 我確保 GPU 節點內的所有 4 個 GPU 都可供我使用。 我運行下面的代碼讓 tensorflow 使用 GPU：

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
            logical_gpus = tf.config.list_logical_devices('GPU')
            print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
    except RuntimeError as e:
        print(e)

輸出中列出了 4 個可用的 GPU。 但是，運行代碼時出現以下錯誤：

Traceback (most recent call last):
  File "/BayesOptimization.py", line 20, in <module>
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/framework/config.py", line 439, in list_logical_devices
    return context.context().list_logical_devices(device_type=device_type)
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1368, in list_logical_devices
    self.ensure_initialized()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 511, in ensure_initialized
    config_str = self.config.SerializeToString()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1015, in config
    gpu_options = self._compute_gpu_options()
  File "/.conda/envs/thesis/lib/python3.9/site-packages/tensorflow/python/eager/context.py", line 1074, in _compute_gpu_options
    raise ValueError("Memory growth cannot differ between GPU devices")
ValueError: Memory growth cannot differ between GPU devices

代碼不應該列出所有可用的 gpus 並將每個 gpus 的內存增長設置為 true 嗎？

我目前正在使用 tensorflow 庫和 python 3.97：

tensorflow                2.4.1           gpu_py39h8236f22_0
tensorflow-base           2.4.1           gpu_py39h29c2da4_0
tensorflow-estimator      2.4.1              pyheb71bc4_0
tensorflow-gpu            2.4.1                h30adc30_0

知道問題是什么以及如何解決嗎？ 提前致謝！

Answer 1

僅嘗試：os.environ["CUDA_VISIBLE_DEVICES"]="0" 而不是 tf.config.experimental.set_memory_growth。 這對我有用。

Answer 2

在多 GPU 設備的情況下，內存增長應該在所有可用的 GPU 中保持不變。 為所有 GPU 設置為 true 或保持為 false。

gpus = tf.config.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

Tensorflow GPU 文檔

Tensorflow：內存增長不能因 GPU 設備而異 | 如何在 tensorflow 中使用多 GPU

問題描述

2 個解決方案

解決方案1
2 2022-03-31 04:24:44

解決方案2
0 2022-12-21 14:35:17

Tensorflow：內存增長不能因 GPU 設備而異 | 如何在 tensorflow 中使用多 GPU

問題描述

2 個解決方案

解決方案1 2 2022-03-31 04:24:44

解決方案2 0 2022-12-21 14:35:17

解決方案1
2 2022-03-31 04:24:44

解決方案2
0 2022-12-21 14:35:17