同时在不同的 GPU 上训练多个 keras/tensorflow 模型

Question

I would like to train multiple models on multiple GPUs at the simultaneously from within a jupyter notebook.我想在 jupyter notebook 中同时在多个 GPU 上训练多个模型。 I am working on a node with 4GPUs.我正在处理一个带有 4GPU 的节点。 I would like to assign one GPU to one model and train 4 different models at the same time.我想将一个 GPU 分配给一个模型并同时训练 4 个不同的模型。 Right now, I select a GPU for one notebook by (eg):现在，我通过（例如）为一台笔记本选择了 GPU：

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '1'

def model(...):
    ....

model.fit(...)

In four different notebooks.在四个不同的笔记本中。 Though, then the results and the output of the fitting procedure is distributed in four different notebooks.但是，拟合过程的结果和输出分布在四个不同的笔记本中。 Though, running them in one notebook sequentially, needs a lot of time.但是，按顺序在一个笔记本中运行它们需要很多时间。 How do you assign GPU's to individual functions and run them in parallel?您如何将 GPU 分配给各个功能并并行运行它们？

Answer 1

I recommend using Tensorflow scopes like so:我建议像这样使用 Tensorflow 范围：

with tf.device_scope('/gpu:0'):
  model1.fit()
with tf.device_scope('/gpu:1'):
  model2.fit()
with tf.device_scope('/gpu:2'):
  model3.fit()

Answer 2

If you want to train models on different cloud GPUs (eg GPU instances from AWS), try this library:如果你想在不同的云 GPU 上训练模型（例如来自 AWS 的 GPU 实例），试试这个库：

!pip install aibro==0.0.45 --extra-index-url https://test.pypi.org/simple

from aibro.train import fit
machine_id = 'g4dn.4xlarge' #instance name on AWS
job_id, trained_model, history = fit(
    model=model,
    train_X=train_X,
    train_Y=train_Y,
    validation_data=(validation_X, validation_Y),
    machine_id=machine_id
)

Tutorial: https://colab.research.google.com/drive/19sXZ4kbic681zqEsrl_CZfB5cegUwuIB#scrollTo=ERqoHEaamR1Y教程： https : //colab.research.google.com/drive/19sXZ4kbic681zqEsrl_CZfB5cegUwuIB#scrollTo=ERqoHEaamR1Y

同时在不同的 GPU 上训练多个 keras/tensorflow 模型

问题描述

2 个解决方案

解决方案1
0 2018-06-23 18:55:50

解决方案2
0 2021-07-07 06:05:11

同时在不同的 GPU 上训练多个 keras/tensorflow 模型

问题描述

2 个解决方案

解决方案1 0 2018-06-23 18:55:50

解决方案2 0 2021-07-07 06:05:11

解决方案1
0 2018-06-23 18:55:50

解决方案2
0 2021-07-07 06:05:11