简体   繁体   English

同时运行 tensorboard 和 python 脚本

[英]Running tensorboard and python script at the same time

I want to submit an sbatch script.我想提交一个 sbatch 脚本。 The main part is training a deep learning model but I also want to run tensorboard at the same time for logging.主要部分是训练深度学习模型,但我也想同时运行 tensorboard 进行日志记录。

Now I have my script.slurm现在我有了我的 script.slurm

#!/bin/bash -l
#SBATCH -N 1
#SBATCH --ntasks-per-node=6
#SBATCH --mem-per-cpu=5GB
#SBATCH --gres=gpu:1

tensorboard --logdir:runs
python3 trainloop.py

It launches tensorboard and runs the script only after I close tensorboard server.它仅在我关闭 tensorboard 服务器后启动 tensorboard 并运行脚本。 I changed it to我把它改成

srun tensorboard --logdir:runs &
srun python3 trainloop.py

but now it loops for some reason trying to launch tensorboard multiple times and gives this error但现在它出于某种原因循环尝试多次启动张量板并给出此错误

E1114 21:45:51.826188 47451355829184 program.py:298] TensorBoard could not bind to port 8872, it was already in use

What is the best approach to have tensorboard server running alongside my script?让 tensorboard 服务器与我的脚本一起运行的最佳方法是什么?

Adding the ampersand ( & ) is the right solution, but you should not be using srun as srun will start as many tasks (ie as many instances of tensorboard --logdir:runs as there are tasks requested with --ntasks-per-node=6 , which will produce the "already in use" error. Same for the second srun , it will start 6 instances of python3 trainloop.py unless that script uses MPI behind the scenes.添加与号 ( & ) 是正确的解决方案,但您不应该使用srun因为 srun 将启动尽可能多的任务(即与--ntasks-per-node=6请求的任务一样多的tensorboard --logdir:runs实例--ntasks-per-node=6 ,这将产生“已在使用”错误。第二个srun相同,它将启动python3 trainloop.py 6 个实例,除非该脚本在幕后使用MPI

So this所以这

tensorboard --logdir:runs &
python3 trainloop.py

should do what you want.应该做你想做的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM