简体   繁体   English

如何在 slurm 工作期间监控资源?

[英]How to monitor resources during slurm job?

I'm running jobs on our university cluster (regular user, no admin rights), which uses the SLURM scheduling system and I'm interested in plotting the CPU and memory usage over time, ie while the job is running.我在我们的大学集群上运行作业(普通用户,没有管理员权限),它使用 SLURM 调度系统,我有兴趣随时间绘制 CPU 和内存使用情况,即在作业运行时。 I know about sacct and sstat and I was thinking to include these commands in my submission script, eg something in the line of我知道sacctsstat并且我正在考虑将这些命令包含在我的提交脚本中,例如

#!/bin/bash
#SBATCH <options>

# Running the actual job in background
srun my_program input.in output.out &

# While loop that records resources
JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
FIRST=0
#sleep time in seconds
STIME=15
while [ "$JobStatus" != "COMPLETED" ]; do
    #update job status
    JobStatus="$(sacct -j $SLURM_JOB_ID | awk 'FNR == 3 {print $6}')"
    if [ "$JobStatus" == "RUNNING" ]; then
        if [ $FIRST -eq 0 ]; then
            sstat --format=AveCPU,AveRSS,MaxRSS -P -j ${SLURM_JOB_ID} >> usage.txt
            FIRST=1
        else
            sstat --format=AveCPU,AveRSS,MaxRSS -P --noheader -j ${SLURM_JOB_ID} >> usage.txt
        fi
        sleep $STIME
    elif [ "$JobStatus" == "PENDING" ]; then
        sleep $STIME
    else
        sacct -j ${SLURM_JOB_ID} --format=AllocCPUS,ReqMem,MaxRSS,AveRSS,AveDiskRead,AveDiskWrite,ReqCPUS,AllocCPUs,NTasks,Elapsed,State >> usage.txt
        JobStatus="COMPLETED"
        break
    fi
done

However, I'm not really convinced of this solution:但是,我并不真正相信这个解决方案:

  • sstat unfortunately doesn't show how many cpus are used at the moment (only average)不幸的是, sstat没有显示当前使用了多少 CPU(仅为平均值)

  • MaxRSS is also not helpful if I try to record memory usage over time如果我尝试记录一段时间内的内存使用情况,MaxRSS 也没有帮助

  • there still seems to be some error (script doesn't stop after job finishes)似乎仍然存在一些错误(作业完成后脚本不会停止)

Does anyone have an idea how to do that properly?有谁知道如何正确地做到这一点? Maybe even with top or htop instead of sstat ?也许甚至用tophtop而不是sstat Any help is much appreciated.任何帮助深表感谢。

Slurm offers a plugin to record a profile of a job (PCU usage, memory usage, even disk/net IO for some technologies) into a HDF5 file. Slurm 提供了一个插件来将作业的配置文件(PCU 使用情况、内存使用情况,甚至某些技术的磁盘/网络 IO)记录到 HDF5 文件中。 The file contains a time series for each measure tracked, and you can choose the time resolution.该文件包含跟踪的每个度量的时间序列,您可以选择时间分辨率。

You can activate it with你可以用

#SBATCH --profile=<all|none|[energy[,|task[,|filesystem[,|network]]]]>

See the documentation here .请参阅此处的文档。

To check that this plugin is installed, run要检查此插件是否已安装,请运行

scontrol show config | grep AcctGatherProfileType

It should output AcctGatherProfileType = acct_gather_profile/hdf5 .它应该输出AcctGatherProfileType = acct_gather_profile/hdf5

The files are created in the folder referred to in the ProfileHDF5Dir Slurm configuration parameter (in slurm.conf )这些文件在引用的文件夹中创建ProfileHDF5Dir SLURM配置参数(在slurm.conf

As for your script, you could try replacing sstat with an SSH connection to the compute nodes to run ps .至于您的脚本,您可以尝试将sstat替换为与计算节点的 SSH 连接以运行ps Assuming pdsh or clush is installed, you could run something like:假设安装了pdshclush ,您可以运行以下命令:

pdsh -j $SLURM_JOB_ID ps -u $USER -o pid,state,cputime,%cpu,rssize,command --columns 100 >> usage.txt

This will give you CPU and memory usage per process.这将为您提供每个进程的 CPU 和内存使用情况。

As a final note, your job never terminates simply because it will terminate when the while loop terminates, and the while loop will terminate when the job terminates... The condition "$JobStatus" == "COMPLETED" will never be observed from within the script.最后要注意的是,您的工作永远不会终止,因为它会在while循环终止时终止,而while循环将在工作终止时终止......条件"$JobStatus" == "COMPLETED"永远不会从内部观察到剧本。 When the job is completed, the script is killed.当作业完成时,脚本被终止。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM