I'm trying to utilize multiple GPUs: Total 8 GPUs, 4 GPU devices per node, total: 2 nodes.
So far I am getting a "not enough memory error":
I checked my tensorflow code partial output, and only 4 GPU devices are being utilized.
My tensorflow code is a tutorial with modified code that uses tensorflow functions with a large input file (works well in an HPC interactive environment with 2 GPUs, using smaller file). The tensorflow code automatically finds the GPUs and spreads the task across them.
How do I get my job code or python program code to find and use all 8 GPUs (from 2 nodes)?
HPC staff can't help me with this and mentioned that complex code is needed. I've spent the last two days looking for a good tutorial and couldn't find any.
Any helpful suggestions are welcome. Here is my current script:
#!/bin/bash
#BSUB -q gpu
#BSUB -J gpus_8
#BSUB -P acc_hpc
#BSUB -R v100
#BSUB -n 2
#BSUB -R "affinity[core(30)]"
#BSUB -R rusage[mem=326000,ngpus_excl_p=4]
#BSUB -W 05:00
#BSUB -o %J.stdout
#BSUB -eo %J.stderr
#BSUB -L /bin/bash
WRKDIR=/scratch/user
ml anaconda3
source activate environ1
python3 gpu_job.py
Use #BSUB -R rusage[mem=326000,ngpus_excl_p=8]
instead. Resource requirements are normally per job. See also https://www.ibm.com/support/knowledgecenter/en/SSWRJV_10.1.0/lsf_resource_sharing/use_gpu_res_reqs.html .
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.