简体   繁体   English

pxssh在Slurm群集中的计算节点之间不起作用

[英]pxssh does not work between compute nodes in a slurm cluster

I'm using the following script for connecting two compute nodes in a slurm cluster. 我正在使用以下脚本来连接Slurm群集中的两个计算节点。

from getpass import getuser
from socket import gethostname
from pexpect import pxssh
import sys 

python = sys.executable
worker_command = "%s -m worker" % python + " %i " + server_socket
pid = 0
children = []
for node, ntasks in node_list.items():
        if node == gethostname():
                continue
        if node != gethostname():
                pid_range = range(pid, pid + ntasks)
                pid += ntasks
                ssh = pxssh.pxssh()
                ssh.login(node, getuser())
                for worker in pid_range:
                        ssh.sendline(worker_command % worker + '&')
                children.append(ssh)

node_list is a dictionary {'cn000': 28, 'cn001': 28} . node_list是字典{'cn000': 28, 'cn001': 28} worker is a python file placed in the working dictionary. worker是放置在工作词典中的python文件。

I expect ssh.sendline to be the same as pexpect.spawn . 我希望ssh.sendlinepexpect.spawn相同。 However, nothing happened after I ran the script. 但是,运行脚本后没有任何反应。

Although an ssh session was built by ssh.login(node, getuser()) , it seems the line ssh.sendline(worker_command % worker) has no effect, because the script to be run by worker_command is not run. 尽管ssh会话是由ssh.login(node, getuser()) ,但是ssh.sendline(worker_command % worker)行似乎无效,因为由worker_command运行的脚本未运行。

How can I fix this? 我怎样才能解决这个问题? Or should I try something else? 还是我应该尝试其他东西?

How can I create one socket on one compute node and connect it with a socket on another compute node? 如何在一个计算节点上创建一个套接字,并将其与另一计算节点上的套接字连接?

There is missing a '%s' from the content of worker_command. 在worker_command的内容中缺少'%s'。 It contains something like this: "/usr/bin/python3 -m worker" -> worker_command%worker should result in error. 它包含如下内容:“ / usr / bin / python3 -m worker”-> worker_command%worker应该导致错误。

If not (it is possible, because this source looks like a short part of the original program), then add ">>workerprocess.log 2>&1" string before '&', then try to run your program and take a look at workerprocess.log on the server! 如果不是(可能,因为此源看起来像原始程序的一小部分),则在“&”之前添加“ >> workerprocess.log 2>&1”字符串,然后尝试运行程序并查看服务器上的workerprocess.log! If your $HOME is writable on the server, you should find the error message(s) in it. 如果$ HOME在服务器上可写,则应在其中找到错误消息。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Python slurm 集群上的作业,节点与核心 - Python job on slurm cluster, nodes vs cores 登录节点如何与 slurm 集群中的计算节点通信? - How login node communicates with compute node in a slurm cluster? 在Slurm群集上运行TensorFlow? - Running TensorFlow on a Slurm Cluster? 给定文档簇,计算语料库和簇之间的相似度 - Given cluster of documents, compute similarity between corpus and the cluster 如何在Slurm集群上的多个节点上运行MPI Python脚本? 错误:警告:无法在2个节点上运行1个进程,将nnode设置为1 - How To Run MPI Python Script across multiple nodes on Slurm cluster? Error: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 具有多个节点的 numba 和 slurm 提交 - numba and slurm submission with multiple nodes 如何在 Azure ML 服务计算集群上并行化工作? - How to parallelize work on an Azure ML Service Compute cluster? 在meshgrid上的两个点之间用完全“n”个节点计算最短路径 - Compute the shortest path with exactly `n` nodes between two points on a meshgrid keras(tensorflow后端)使用Slurm在集群上运行 - keras (tensorflow backend) run on a cluster using slurm 如何在集群中通过 slurm 运行 python 脚本? - How to run a python script through slurm in a cluster?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM