IBM Spectrum LSF - Accessing multiple GPUs on different HPC nodes

Question

I'm trying to utilize multiple GPUs: Total 8 GPUs, 4 GPU devices per node, total: 2 nodes.

So far I am getting a "not enough memory error":

I checked my tensorflow code partial output, and only 4 GPU devices are being utilized.

My tensorflow code is a tutorial with modified code that uses tensorflow functions with a large input file (works well in an HPC interactive environment with 2 GPUs, using smaller file). The tensorflow code automatically finds the GPUs and spreads the task across them.

How do I get my job code or python program code to find and use all 8 GPUs (from 2 nodes)?

HPC staff can't help me with this and mentioned that complex code is needed. I've spent the last two days looking for a good tutorial and couldn't find any.

Any helpful suggestions are welcome. Here is my current script:

#!/bin/bash
#BSUB -q gpu
#BSUB -J gpus_8
#BSUB -P acc_hpc
#BSUB -R v100
#BSUB -n 2
#BSUB -R "affinity[core(30)]"
#BSUB -R rusage[mem=326000,ngpus_excl_p=4]
#BSUB -W 05:00
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash

WRKDIR=/scratch/user
ml anaconda3
source activate environ1

python3 gpu_job.py

Answer 1

Use #BSUB -R rusage[mem=326000,ngpus_excl_p=8] instead. Resource requirements are normally per job. See also https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_sharing/use_gpu_res_reqs.html .

IBM Spectrum LSF - Accessing multiple GPUs on different HPC nodes

Question

1 answers

solution1
0 2020-01-27 19:09:41

IBM Spectrum LSF - Accessing multiple GPUs on different HPC nodes

Question

1 answers

solution1 0 2020-01-27 19:09:41

solution1
0 2020-01-27 19:09:41