简体   繁体   中英

How to share a single GPU deep learning server?

For our development team we want to build a central GPU server for their deep learning / training tasks (with one or more strong GPU(s) instead of mulitple workstations for each team member with their own GPU). I guess this is a common setup, but I am not sure how to make this GPU sharing work for multiple team members simultaneously. We work with Tensorflow/Keras and Python scripts.

My question is: What is the typical approach to let team members train their models on that central server? Just allow them to access via SSH and do network training directly from command line? Or setup a Jupyter Hub server, so that our developers can run code from their browser?

My main question: If there is only one GPU, how can we make sure that multiple users cannot run their code (ie train their networks) at the same time? Is there a way to kind of submit training jobs on a central server software and those are executed on the GPU one after the other?

(Sorry if this is not the correct site to ask this question, but which other Stack Exchange site would be better?)

Even though we don't need this setup any more, one option to solve this is via a workload manager like slurm . There is also GPU management available.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM