简体   繁体   中英

IBM Spectrum LSF - Accessing multiple GPUs on different HPC nodes

I'm trying to utilize multiple GPUs: Total 8 GPUs, 4 GPU devices per node, total: 2 nodes.

So far I am getting a "not enough memory error":

I checked my tensorflow code partial output, and only 4 GPU devices are being utilized.

My tensorflow code is a tutorial with modified code that uses tensorflow functions with a large input file (works well in an HPC interactive environment with 2 GPUs, using smaller file). The tensorflow code automatically finds the GPUs and spreads the task across them.

How do I get my job code or python program code to find and use all 8 GPUs (from 2 nodes)?

HPC staff can't help me with this and mentioned that complex code is needed. I've spent the last two days looking for a good tutorial and couldn't find any.

Any helpful suggestions are welcome. Here is my current script:

#!/bin/bash
#BSUB -q gpu
#BSUB -J gpus_8
#BSUB -P acc_hpc
#BSUB -R v100
#BSUB -n 2
#BSUB -R "affinity[core(30)]"
#BSUB -R rusage[mem=326000,ngpus_excl_p=4]
#BSUB -W 05:00
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash

WRKDIR=/scratch/user
ml anaconda3
source activate environ1

python3 gpu_job.py

Use #BSUB -R rusage[mem=326000,ngpus_excl_p=8] instead. Resource requirements are normally per job. See also https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_sharing/use_gpu_res_reqs.html .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM