I am trying to allocate 2 GPUs and run 1 python script over these 2 GPUs. The python script requires the variables $AMBERHOME, which is obtained by sourcing the amber.sh script, and $CUDA_VISIBLE_DEVICES. The $CUDA_VISIBLE_DEVICES variable should equal something like 0,1 for the two GPUS I have requested.
Currently, I have been experimenting with this basic script.
#!/bin/bash
#
#BATCH --job-name=test
#SBATCH --output=slurm_info
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --time=5:00:00
#SBATCH --partition=gpu-v100
## Prepare Run
source /usr/local/amber20/amber.sh
export CUDA_VISIBLE_DEVICES=0,1
## Perform Run
python calculations.py
When I run the script, I can see that 2 GPUs are requested.
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
11111 GPU test jsmith CF 0:02 2 gpu-[1-2]
When I look at the output ('slurm_info') I see,
cpu-bind=MASK - gpu-1, task 0 0 [10111]: mask 0x1 set
and of course information about the failed job.
Typically when I run this script on my local workstation, I have 2 GPUs there and when entering nvidia-smi into the command line, I see...
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 24W / 300W | 0MiB / 16160MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 GeForce RTX 208... On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 24W / 300W | 0MiB / 16160MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
However, when I use nvidia-smi with my previous batch script on the cluster I see the following.
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.91.03 Driver Version: 460.91.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:00:1E.0 Off | 0 |
| N/A 29C P0 24W / 300W | 0MiB / 16160MiB | 0% E. Process |
| | | N/A |
+-------------------------------+----------------------+----------------------+
This makes me think that when the python script runs it is only seeing the one GPU.
You are requesting two nodes , not two GPUs . The correct syntax for requesting GPUs depends on the Slurm version and how your cluster is set up. But you generally use #SBATCH -G 2
to request two GPUs.
Slurm usually also takes care of CUDA_VISIBLE_DEVICES
for you, so no need for that. Try this:
#!/bin/bash
#
#BATCH --job-name=test
#SBATCH --output=slurm_info
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=2 #change that accordingly
#SBATCH -G 2
#SBATCH --time=5:00:00
#SBATCH --partition=gpu-v100
## Prepare Run
source /usr/local/amber20/amber.sh
## Perform Run
python calculations.py
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.