简体   繁体   English

如何在GPU群集中配置Python?

[英]How to configure Python in a GPU cluster?

I have a GPU cluster with one storage-node and several computing nodes each has 8 GPU. 我有一个GPU集群,其中有一个存储节点,几个计算节点每个都有8个GPU。 I am configuring the cluster. 我正在配置集群。

One of the task is to configure the python, what we need is several versions of Python and some python packages, and for some packages we may require several versions of it, such as different version of tensorflow. 任务之一是配置python,我们需要的是Python的多个版本和一些python软件包,对于某些软件包,我们可能需要它的多个版本,例如不同版本的tensorflow。

So the question is how to configure the python and the packages so that it' convenient to use different version of the package I want to use. 因此,问题是如何配置python和软件包,以便于使用要使用的软件包的不同版本。

I have installed both python2.7 and python3.6 in each computing node and in the storage node. 我已经在每个计算节点和存储节点中都安装了python2.7和python3.6。 But I think it is a good way if one has a huge amount of computing node to configure. 但是我认为,如果要配置大量计算节点,这是一个好方法。 One of the solution is to install python in the share directory of the cluster instead of the default /usr/local path. 解决方案之一是将python安装在群集的共享目录中,而不是默认的/ usr / local路径。 Anyone has a better way to do this? 有人有更好的方法吗?

What I use now is OpenPBS(Torque) and I am new to HPC. 我现在使用的是OpenPBS(Torque),我是HPC的新手。

Thanks a lot. 非常感谢。

You can install Modules software environment in a shared directory accessible on every node. 您可以将Modules软件环境安装在每个节点上都可以访问的共享目录中。 Then it will be easy to load a specific version of python or TensorFlow: 然后很容易加载特定版本的python或TensorFlow:

module load lang/Python/3.6.0
module load lib/Tensorflow/1.1.0

Then, if for some packages we may require several versions of it, you can have a look at Python virtualenv that permits to install several version of the same package. 然后,如果对于某些软件包我们可能需要几个版本,则可以看看Python virtualenv ,它允许安装同一软件包的多个版本。 To share it on all the nodes, consider to create your virtualenv on a shared mount point. 要在所有节点上共享它,请考虑在共享安装点上创建您的virtualenv。

You could install each piece of software on the storage node under a certain directory and mount that directory on the compute nodes. 您可以将每个软件安装在某个目录下的存储节点上,然后将该目录安装在计算节点上。 Then you don't have to install each software several times. 然后,您不必多次安装每个软件。

A common solution to this problem are Environment Modules . 解决此问题的常见方法是环境模块 You install your software as a module. 您将软件作为模块安装。 This means that the software is installed in a certain directory (eg /opt/modules/python/3.6/ ) together with a modulefile. 这意味着该软件与/opt/modules/python/3.6/一起安装在某个目录(例如/opt/modules/python/3.6/ )中。 When you do module load python/3.6 , the modulefile sets environment variables such that Python3.6 is in PATH , PYTHONPATH , etc. 当您执行module load python/3.6 ,modulefile会设置环境变量,例如Python3.6位于PATHPYTHONPATH等中。

This results in a nice separation of your software stack and also enables you to install newer versions of tensorflow without messing up the environment. 这样可以很好地分离您的软件堆栈,还可以使您在不影响环境的情况下安装新版本的tensorflow。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM