在Slurm群集上运行TensorFlow？

Question

I could get access to a computing cluster, specifically one node with two 12-Core CPUs, which is running with Slurm Workload Manager . 我可以访问计算集群，特别是一个带有两个12核CPU的节点，它与Slurm Workload Manager一起运行。

I would like to run TensorFlow on that system but unfortunately I were not able to find any information about how to do this or if this is even possible. 我想在该系统上运行TensorFlow ，但遗憾的是我无法找到有关如何执行此操作的信息，或者甚至是否可能。 I am new to this but as far as I understand it, I would have to run TensorFlow by creating a Slurm job and can not directly execute python/tensorflow via ssh. 我是新手，但据我所知，我必须通过创建Slurm作业来运行TensorFlow，并且不能通过ssh直接执行python / tensorflow。

Has anyone an idea, tutorial or any kind of source on this topic? 有没有人有关于这个主题的想法，教程或任何类型的来源？

Answer 1

It's relatively simple. 它相对简单。

Under the simplifying assumptions that you request one process per host, slurm will provide you with all the information you need in environment variables, specifically SLURM_PROCID, SLURM_NPROCS and SLURM_NODELIST. 在每个主机请求一个进程的简化假设下，slurm将为您提供环境变量中所需的所有信息，特别是SLURM_PROCID，SLURM_NPROCS和SLURM_NODELIST。

For example, you can initialize your task index, the number of tasks and the nodelist as follows: 例如，您可以按如下方式初始化任务索引，任务数和节点列表：

from hostlist import expand_hostlist
task_index  = int( os.environ['SLURM_PROCID'] )
n_tasks     = int( os.environ['SLURM_NPROCS'] )
tf_hostlist = [ ("%s:22222" % host) for host in
                expand_hostlist( os.environ['SLURM_NODELIST']) ]

Note that slurm gives you a host list in its compressed format (eg, "myhost[11-99]"), that you need to expand. 请注意，slurm为您提供了压缩格式的主机列表（例如，“myhost [11-99]”），您需要进行扩展。 I do that with module hostlist by Kent Engström, available here https://pypi.python.org/pypi/python-hostlist 我使用KentEngström的模块主机列表来做到这一点，可在此处获得https://pypi.python.org/pypi/python-hostlist

At that point, you can go right ahead and create your TensorFlow cluster specification and server with the information you have available, eg: 此时，您可以直接使用您可用的信息创建TensorFlow集群规范和服务器，例如：

cluster = tf.train.ClusterSpec( {"your_taskname" : tf_hostlist } )
server  = tf.train.Server( cluster.as_cluster_def(),
                           job_name   = "your_taskname",
                           task_index = task_index )

And you're set! 你就定了！ You can now perform TensorFlow node placement on a specific host of your allocation with the usual syntax: 您现在可以使用通常的语法在分配的特定主机上执行TensorFlow节点放置：

for idx in range(n_tasks):
   with tf.device("/job:your_taskname/task:%d" % idx ):
       ...

A flaw with the code reported above is that all your jobs will instruct Tensorflow to install servers listening at fixed port 22222. If multiple such jobs happen to be scheduled to the same node, the second one will fail to listen to 22222. 上面报告的代码存在的缺陷是，您的所有作业都将指示Tensorflow安装在固定端口22222上侦听的服务器。如果多个此类作业恰好安排到同一节点，则第二个将无法侦听22222。

A better solution is to let slurm reserve ports for each job. 更好的解决方案是让slurm为每个工作预留端口。 You need to bring your slurm administrator on board and ask him to configure slurm so it allows you to ask for ports with the --resv-ports option. 您需要将您的slurm管理员带到船上并要求他配置slurm，以便它允许您使用--resv-ports选项请求端口。 In practice, this requires asking them to add a line like the following in their slurm.conf: 在实践中，这需要他们在slurm.conf中添加如下所示的行：

MpiParams=ports=15000-19999

Before you bug your slurm admin, check what options are already configured, eg, with: 在您的slurm管理员出错之前，请检查已配置的选项，例如：

scontrol show config | grep MpiParams

If your site already uses an old version of OpenMPI, there's a chance an option like this is already in place. 如果您的网站已经使用旧版本的OpenMPI，则可能会有类似这样的选项。

Then, amend my first snippet of code as follows: 然后，修改我的第一段代码，如下所示：

from hostlist import expand_hostlist
task_index  = int( os.environ['SLURM_PROCID'] )
n_tasks     = int( os.environ['SLURM_NPROCS'] )
port        = int( os.environ['SLURM_STEP_RESV_PORTS'].split('-')[0] )
tf_hostlist = [ ("%s:%s" % (host,port)) for host in
                expand_hostlist( os.environ['SLURM_NODELIST']) ]

Good luck! 祝好运！

Answer 2

You can simply pass a batch script to slurm with the sbatch command like such 您可以使用sbatch命令将批处理脚本简单地传递给slurm

sbatch --partition=part start.sh

listing available partitions can be done with sinfo . 列出可用分区可以使用sinfo 。

start.sh (possible configuration) : start.sh （可能的配置）：

#!/bin/sh
#SBATCH -N 1      # nodes requested
#SBATCH -n 1      # tasks requested
#SBATCH -c 10      # cores requested
#SBATCH --mem=32000  # memory in Mb
#SBATCH -o outfile  # send stdout to outfile
#SBATCH -e errfile  # send stderr to errfile
python run.py

whereas run.py contains the script you want to be executed with slurm ie your tensorflow code. 而run.py包含你想用slurm执行的脚本，即你的tensorflow代码。

You can look up the details here: https://slurm.schedmd.com/sbatch.html 您可以在此处查看详细信息： https ： //slurm.schedmd.com/sbatch.html

在Slurm群集上运行TensorFlow？

问题描述

2 个解决方案

解决方案1
23 已采纳 2016-04-22 20:53:30

解决方案2
4 2017-11-22 10:13:05

在Slurm群集上运行TensorFlow？

问题描述

2 个解决方案

解决方案1 23 已采纳 2016-04-22 20:53:30

解决方案2 4 2017-11-22 10:13:05

解决方案1
23 已采纳 2016-04-22 20:53:30

解决方案2
4 2017-11-22 10:13:05