简体   繁体   中英

Get the number of free GPUs on a SLURM Cluster

I am scheduling jobs on a cluster to take up either 1 or 2 gpus of some nodes. I frequently use sinfo -p gpu to list all nodes of the 'gpu' partition as well as their state. Some appear with the state 'idle', indicating that there is no job running on them. Some however appear with state 'mix', meaning that there is some job running on them.

However, there is no information given how many GPUs on a mixed-State node are actually taken. Is there any, possibly sinfo based command, to let me know the number of free gpus on the server, possibly per node?

The sinfo manual did not gave any insights expect using the output option "%G" which just uses the number of gpus available in general. Thanks!

Update: I realized that I can use "%C" to print out the allocated/idle use of CPUs per node with the following command:

--format="%9P %l %10n %.14C %.10T "

I want to do the exact same thing but with GPUs instead of CPUs.

Unfortunately sinfo does not provide the information right away. You will have to parse the output of scontrol :

scontrol -o show node | grep  -Po "AllocTRES[^ ]*(?<=gpu=)\K[0-9]+" | paste -d + -s | bc

This lists all nodes, extracts the part that corresponds to AllocTRES (allocated trackable resources, which GPUs are part of), and in that part, more specifically the part that concerns the GPUs. It then uses paste and bc to compute the sum (you could be using awk instead if you prefer).

If you replace Alloc with Cfg in the one-liner, you will have the total number of GPUs configured.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM