IBM Spectrum LSF - 訪問不同 HPC 節點上的多個 GPU

Question

我正在嘗試使用多個 GPU：總共 8 個 GPU，每個節點 4 個 GPU 設備，總共：2 個節點。

到目前為止，我收到“內存不足錯誤”：

我檢查了我的 tensorflow 代碼部分輸出，只有 4 個 GPU 設備被使用。

我的 tensorflow 代碼是一個修改代碼的教程，它使用帶有大輸入文件的 tensorflow 函數（在具有 2 個 GPU 的 HPC 交互環境中運行良好，使用較小的文件）。 TensorFlow 代碼自動找到 GPU 並將任務分配給它們。

如何獲取我的工作代碼或 python 程序代碼以查找和使用所有 8 個 GPU（來自 2 個節點）？

HPC 工作人員無法幫助我解決這個問題，並提到需要復雜的代碼。 這兩天我一直在尋找一個好的教程，但找不到。

歡迎任何有用的建議。 這是我當前的腳本：

#!/bin/bash
#BSUB -q gpu
#BSUB -J gpus_8
#BSUB -P acc_hpc
#BSUB -R v100
#BSUB -n 2
#BSUB -R "affinity[core(30)]"
#BSUB -R rusage[mem=326000,ngpus_excl_p=4]
#BSUB -W 05:00
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash

WRKDIR=/scratch/user
ml anaconda3
source activate environ1

python3 gpu_job.py

Answer 1

使用#BSUB -R rusage[mem=326000,ngpus_excl_p=8]代替。 資源需求通常是每個作業。 另請參閱https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_sharing/use_gpu_res_reqs.html 。

IBM Spectrum LSF - 訪問不同 HPC 節點上的多個 GPU

問題描述

1 個解決方案

解決方案1
0 2020-01-27 19:09:41

IBM Spectrum LSF - 訪問不同 HPC 節點上的多個 GPU

問題描述

1 個解決方案

解決方案1 0 2020-01-27 19:09:41

解決方案1
0 2020-01-27 19:09:41