简体   繁体   English

当存在 GPU 时,如何在 TensorFlow 中的单个脚本中训练多个模型?

[英]How does one train multiple models in a single script in TensorFlow when there are GPUs present?

Say I have access to a number of GPUs in a single machine (for the sake of argument assume 8GPUs each with max memory of 8GB each in one single machine with some amount of RAM and disk).假设我可以在一台机器上访问多个 GPU(为了论证,假设在一台机器上有 8 个 GPU,每个 GPU 最大 memory,每个 8GB,具有一定数量的 RAM 和磁盘)。 I wanted to run in one single script and in one single machine a program that evaluates multiple models (say 50 or 200) in TensorFlow, each with a different hyper parameter setting (say, step-size, decay rate, batch size, epochs/iterations, etc).我想在一个脚本和一台机器上运行一个程序,该程序在 TensorFlow 中评估多个模型(比如 50 或 200),每个模型都有不同的超参数设置(比如,步长、衰减率、批量大小、epochs/迭代等)。 At the end of training assume we just record its accuracy and get rid of the model (if you want assume the model is being check pointed every so often, so its fine to just throw away the model and start training from scratch. You may also assume some other data may be recorded like the specific hyper params, train, validation, train errors are recorded as we train etc).在训练结束时,假设我们只是记录它的准确性并摆脱 model(如果你想假设 model 经常被检查指向,那么扔掉 model 并从头开始训练就可以了。你也可以假设可能会记录一些其他数据,例如特定的超参数、训练、验证、训练错误等)。

Currently I have a (pseudo-)script that looks as follow:目前我有一个(伪)脚本,如下所示:

def train_multiple_modles_in_one_script_with_gpu(arg):
    '''
    trains multiple NN models in one session using GPUs correctly.

    arg = some obj/struct with the params for trianing each of the models.
    '''
    #### try mutliple models
    for mdl_id in range(100):
        #### define/create graph
        graph = tf.Graph()
        with graph.as_default():
            ### get mdl
            x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
            y_ = tf.placeholder(float_type, get_y_shape(arg))
            y = get_mdl(arg,x)
            ### get loss and accuracy
            loss, accuracy = get_accuracy_loss(arg,x,y,y_)
            ### get optimizer variables
            opt = get_optimizer(arg)
            train_step = opt.minimize(loss, global_step=global_step)
        #### run session
        with tf.Session(graph=graph) as sess:
            # train
            for i in range(nb_iterations):
                batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
                sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
                # check_point mdl
                if i % report_error_freq == 0:
                    sess.run(step.assign(i))
                    #
                    train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
                    test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
                    print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )

essentially it tries lots of models in one single run but it builds each model in a separate graph and runs each one in a separate session.本质上,它在一次运行中尝试了许多模型,但它在单独的图中构建每个 model,并在单独的 session 中运行每个模型。

I guess my main worry is that its unclear to me how tensorflow under the hood allocates resources for the GPUs to be used.我想我主要担心的是我不清楚 tensorflow 是如何为要使用的 GPU 分配资源的。 For example, does it load the (part of the) data set only when a session is ran?例如,它是否仅在运行 session 时才加载(部分)数据集? When I create a graph and a model, is it brought in the GPU immediately or when is it inserted in the GPU?当我创建一个图形和一个 model 时,它是立即带入 GPU 还是什么时候插入 GPU? Do I need to clear/free the GPU each time it tries a new model?每次尝试新的 model 时,我是否需要清除/释放 GPU? I don't actually care too much if the models are ran in parallel in multiple GPU (which can be a nice addition), but I want it to first run everything serially without crashing.我实际上不太关心模型是否在多个 GPU 中并行运行(这可能是一个很好的补充),但我希望它首先串行运行所有内容而不会崩溃。 Is there anything special I need to do for this to work?有什么特别的我需要做的才能让它工作吗?


Currently I am getting an error that starts as follow:目前我收到一个错误,开始如下:

I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                   340000768
InUse:                   336114944
MaxInUse:                339954944
NumAllocs:                      78
MaxAllocSize:            335665152

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]

and further down the line it says:再往下说:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
         [[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)

however further down the output file (where it prints) it seems to print fine the errors/messages that should show as training proceeds.然而,在 output 文件(它打印的地方)的更下方,它似乎可以打印出随着训练的进行应该显示的错误/消息。 Does this mean that it didn't run out of resources?这是否意味着它没有耗尽资源? Or was it actually able to use the GPU?或者它实际上能够使用 GPU 吗? If it was able to use the CPU instead of the CPU, when why is this an error only happening when GPU are about to be used?如果它能够使用CPU而不是CPU,为什么只有在GPU即将被使用时才会出现这个错误?

The weird thing is that the data set is really not that big (all 60K points are 24.5M) and when I run a single model locally in my own computer it seems that the process uses less than 5GB.奇怪的是,数据集真的没有那么大(所有 60K 点都是 24.5M),当我在自己的计算机上本地运行单个 model 时,该进程似乎使用了不到 5GB。 The GPUs have at least 8GB and the computer with them has plenty of RAM and disk (at least 16GB). GPU 至少有 8GB,配备它们的计算机有足够的 RAM 和磁盘(至少 16GB)。 Thus, the errors that tensorflow is throwing at me are quite puzzling.因此,tensorflow 向我抛出的错误非常令人费解。 What is it trying to do and why are they occurring?它试图做什么,为什么会发生? Any ideas?有任何想法吗?


After reading the answer that suggests to use the multiprocessing library I came up with the following script:在阅读了建议使用多处理库的答案后,我想出了以下脚本:

def train_mdl(args):
    train(mdl,args)

if __name__ == '__main__':
    for mdl_id in range(100):
        # train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
        p = Process(target=train_mdl, args=(args,))
        p.start()
        p.join()
    print('Done training all models!')

honestly I am not sure why his answer suggests to use pool, or why there are weird tuple brackets but this is what would make sense for me.老实说,我不确定为什么他的回答建议使用池,或者为什么会有奇怪的元组括号,但这对我来说是有意义的。 Would the resources for tensorflow be re-allocated every time a new process is created in the above loop?在上述循环中每次创建新进程时,tensorflow 的资源是否会重新分配?

I think that running all models in one single script can be bad practice in the long term (see my suggestion below for a better alternative). 我认为在一个单一的脚本中运行所有模型从长远来看可能是不好的做法(请参阅下面的建议以获得更好的替代方案)。 However, if you would like to do it, here is a solution: You can encapsulate your TF session into a process with the multiprocessing module, this will make sure TF releases the session memory once the process is done. 但是,如果您想这样做,这里有一个解决方案:您可以使用multiprocessing模块将TF会话封装到一个进程中,这将确保TF在进程完成后释放会话内存。 Here is a code snippet: 这是一段代码:

from multiprocessing import Pool
import contextlib
def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
    < your code >

num_pool_worker=1 # can be bigger than 1, to enable parallel execution 
with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
     pool_results = po.map_async(my_model,
                                    ((param1, param2, param3)
                                     for param1, param2, param3 in params_list))
     results_list = pool_results.get()

Note from OP: The random number generator seed does not reset automatically with the multi-processing library if you choose to use it. OP注意:如果您选择使用随机数生成器种子,则不会使用多处理库自动重置。 Details here: Using python multiprocessing with different random seed for each process 详细信息: 对每个进程使用带有不同随机种子的python多处理

About TF resource allocation: Usually TF allocates much more resources than it needs. 关于TF资源分配:通常TF分配的资源比它需要的多得多。 Many times you can restrict each process to use a fraction of the total GPU memory, and discover through trial and error the fraction your script requires. 很多时候,您可以限制每个进程使用总GPU内存的一小部分,并通过反复试验发现脚本所需的分数。

You can do it with the following snippet 您可以使用以下代码段执行此操作

gpu_memory_fraction = 0.3 # Choose this number through trial and error
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
session_config = tf.ConfigProto(gpu_options=gpu_options)
sess = tf.Session(config=session_config, graph=graph)

Note that sometimes TF increases the memory usage in order to accelerate the execution. 请注意,有时TF会增加内存使用量以加快执行速度。 Therefore, reducing the memory usage might make your model run slower. 因此,减少内存使用量可能会使模型运行速度变慢。

Answers to the new questions in your edit/comments: 您编辑/评论中的新问题的答案:

  1. Yes, Tensorflow will be re-allocated every time a new process is created, and cleared once a process ends. 是的,每次创建新流程时都会重新分配Tensorflow,并在流程结束后清除。

  2. The for-loop in your edit should also do the job. 编辑中的for循环也应该完成这项工作。 I suggest to use Pool instead, because it will enable you to run several models concurrently on a single GPU. 我建议使用Pool,因为它可以让你在一个GPU上同时运行多个模型。 See my notes about setting gpu_memory_fraction and "choosing the maximal number of processes". 请参阅我关于设置gpu_memory_fraction和“选择最大进程数”的说明。 Also note that: (1) The Pool map runs the loop for you, so you don't need an outer for-loop once you use it. 另请注意:(1)Pool map为您运行循环,因此一旦使用它就不需要外部for循环。 (2) In your example, you should have something like mdl=get_model(args) before calling train() (2)在你的例子中,你应该在调用train()之前有类似mdl=get_model(args)东西

  3. Weird tuple parenthesis: Pool only accepts a single argument, therefore we use a tuple to pass multiple arguments. 奇怪的元组括号:Pool只接受一个参数,因此我们使用一个元组来传递多个参数。 See multiprocessing.pool.map and function with two arguments for more details. 有关更多详细信息,请参阅multiprocessing.pool.map和带有两个参数的函数 As suggested in one answer, you can make it more readable with 正如一个答案中所建议的那样,你可以使它更具可读性

     def train_mdl(params): (x,y)=params < your code > 
  4. As @Seven suggested, you can use CUDA_VISIBLE_DEVICES environment variable to choose which GPU to use for your process. 正如@Seven建议的那样,您可以使用CUDA_VISIBLE_DEVICES环境变量来选择要用于您的进程的GPU。 You can do it from within your python script using the following on the beginning of the process function ( train_mdl ). 您可以在过程函数( train_mdl )的开头使用以下内容在python脚本中执行此操作。

     import os # the import can be on the top of the python script os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id) 

A better practice for executing your experiments would be to isolate your training/evaluation code from the hyper parameters/ model search code. 执行实验的更好方法是将训练/评估代码与超参数/模型搜索代码隔离开来。 Eg have a script named train.py , which accepts a specific combination of hyper parameters and references to your data as arguments, and executes training for a single model. 例如,有一个名为train.py的脚本,它接受超参数的特定组合和对数据的引用作为参数,并对单个模型执行训练。

Then, to iterate through the all the possible combinations of parameters you can use a simple task (jobs) queue, and submit all the possible combinations of hyper-parametrs as separate jobs. 然后,要迭代所有可能的参数组合,您可以使用简单的任务(作业)队列,并将所有可能的超参数组合作为单独的作业提交。 The task queue will feed your jobs one at a time to your machine. 任务队列将一次向您的计算机提供一个作业。 Usually, you can also set the queue to execute number of processes concurrently (see details below). 通常,您还可以将队列设置为同时执行多个进程(请参阅下面的详细信息)。

Specifically, I use task spooler , which is super easy to install and handful (doesn't requires admin privileges, details below). 具体来说,我使用任务假脱机程序 ,它非常容易安装和少数(不需要管理员权限,详情如下)。

Basic usage is (see notes below about task spooler usage): 基本用法是(请参阅下面有关任务假脱机程序使用情况的说明):

ts <your-command>

In practice, I have a separate python script that manages my experiments, set all the arguments per specific experiment and send the jobs to the ts queue. 在实践中,我有一个单独的python脚本来管理我的实验,设置每个特定实验的所有参数并将作业发送到ts队列。

Here are some relevant snippets of python code from my experiments manager: 以下是我的实验经理的python代码的一些相关摘要:

run_bash executes a bash command run_bash执行bash命令

def run_bash(cmd):
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
    out = p.stdout.read().strip()
    return out  # This is the stdout from the shell command

The next snippet sets the number of concurrent processes to be run (see note below about choosing the maximal number of processes): 下一个代码段设置要运行的并发进程数(请参阅下面有关选择最大进程数的说明):

max_job_num_per_gpu = 2
run_bash('ts -S %d'%max_job_num_per_gpu)

The next snippet iterates through a list of all combinations of hyper params / model params. 下一个片段迭代了超级参数/模型参数的所有组合的列表。 Each element of the list is a dictionary, where the keys are the command line arguments for the train.py script 列表的每个元素都是一个字典,其中键是train.py脚本的命令行参数

for combination_dict in combinations_list:

    job_cmd = 'python train.py ' + '  '.join(
            ['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])

    submit_cmd = "ts bash -c '%s'" % job_cmd
    run_bash(submit_cmd)

A note about about choosing the maximal number of processes: 关于选择最大进程数的说明:

If you are short on GPUs, you can use gpu_memory_fraction you found, to set the number of processes as max_job_num_per_gpu=int(1/gpu_memory_fraction) 如果您缺少GPU,可以使用找到的gpu_memory_fraction ,将进程数设置为max_job_num_per_gpu=int(1/gpu_memory_fraction)

Notes about task spooler ( ts ): 有关任务假脱机程序( ts )的说明:

  1. You could set the number of concurrent processes to run ("slots") with: 您可以使用以下命令设置要运行的并发进程数(“slots”):

    ts -S <number-of-slots>

  2. Installing ts doesn't requires admin privileges. 安装ts不需要管理员权限。 You can download and compile it from source with a simple make , add it to your path and you're done. 您可以使用简单的make从源代码下载并编译它,将其添加到您的路径中,您就完成了。

  3. You can set up multiple queues (I use it for multiple GPUs), with 您可以设置多个队列(我将它用于多个GPU)

    TS_SOCKET=<path_to_queue_name> ts <your-command>

    eg 例如

    TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>

    TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>

  4. See here for further usage example 有关更多用法示例,请参见此处

A note about automatically setting the path names and file names: Once you separate your main code from the experiment manager, you will need an efficient way to generate file names and directory names, given the hyper-params. 关于自动设置路径名和文件名的注意事项:将主代码与实验管理器分开后,您需要一种有效的方法来生成文件名和目录名,给定超级参数。 I usually keep my important hyper params in a dictionary and use the following function to generate a single chained string from the dictionary key-value pairs. 我通常将我的重要超级参数保存在字典中,并使用以下函数从字典键值对生成单个链式字符串。 Here are the functions I use for doing it: 以下是我用来执行此操作的函数:

def build_string_from_dict(d, sep='%'):
    """
     Builds a string from a dictionary.
     Mainly used for formatting hyper-params to file names.
     Key-value pairs are sorted by the key name.

    Args:
        d: dictionary

    Returns: string
    :param d: input dictionary
    :param sep: key-value separator

    """

    return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])


def _value2str(val):
    if isinstance(val, float): 
        # %g means: "Floating point format.
        # Uses lowercase exponential format if exponent is less than -4 or not less than precision,
        # decimal format otherwise."
        val = '%g' % val
    else:
        val = '{}'.format(val)
    val = re.sub('\.', '_', val)
    return val

As I understand, firstly tensorflow constructs a symbolic graph and infers the derivatives based on chain rule. 据我所知,首先,tensorflow构造一个符号图,并根据链规则推断出衍生物。 Then allocates memory for all (necessary) tensors, including some inputs and outputs of layers for efficiency. 然后为所有(必要的)张量分配内存,包括一些层的输入和输出以提高效率。 When running a session, data will be loaded into the graph but in general, memory use will not change any more. 运行会话时,数据将加载到图形中,但通常情况下,内存使用不会再发生变化。

The error you met, I guess, may be caused by constructing several models in one GPU. 我猜测,您遇到的错误可能是由在一个GPU中构建多个模型引起的。

Isolating your training/evaluation code from the hyper parameters is a good choice, as @user2476373 proposed. 正如@ user2476373所提出的那样,将训练/评估代码与超参数隔离是一个不错的选择。 But I am using bash script directly, not task spooler (may be it's more convenient), eg 但我直接使用bash脚本,而不是任务假脱机程序(可能更方便),例如

CUDA_VISIBLE_DEVICES=0 python train.py --lrn_rate 0.01 --weight_decay_rate 0.001 --momentum 0.9 --batch_size 8 --max_iter 60000 --snapshot 5000
CUDA_VISIBLE_DEVICES=0 python eval.py 

Or you can write a 'for' loop in the bash script, not necessarily in python script. 或者你可以在bash脚本中编写一个'for'循环,不一定在python脚本中。 Noting that I used CUDA_VISIBLE_DEVICES=0 at beginning of the script (the index could be 7 if you have 8 GPUs in one machine). 注意到我在脚本开头使用了CUDA_VISIBLE_DEVICES=0 (如果你在一台机器上有8个GPU,索引可能是7)。 Because based on my experience, I've found that tensorflow uses all GPUs in one machine if I didn't specify operations use which GPU with the code like this 因为根据我的经验,我发现tensorflow使用一台机器上的所有GPU,如果我没有指定操作使用哪个GPU与这样的代码

with tf.device('/gpu:0'):

If you want to try multi-GPU implementation, there is some example . 如果你想尝试多GPU实现,有一些例子

Hope this could help you. 希望这可以帮到你。

You probably don't want to do this. 你可能不想这样做。

If you run thousands and thousands of models on your data, and pick the one that evaluates best, you are not doing machine learning; 如果您在数据上运行成千上万的模型,并选择评估最佳的模型,那么您就不会进行机器学习; instead you are memorizing your data set, and there is no guarantee that the model you pick will perform at all outside that data set. 相反,您正在记忆您的数据集,并且无法保证您选择的模型将在该数据集之外执行。

In other words, that approach is similar to having a single model, which has thousands of degrees of liberty. 换句话说,这种方法类似于拥有单一模型,该模型具有数千个自由度。 Having a model with such high order of complexity is problematic, since it will be able to fit your data better than is actually warranted; 拥有如此高复杂度的模型是有问题的,因为它能够比实际保证更好地适应您的数据; such a model is annoyingly able to memorize any noise (outliers, measurement errors, and such) in your training data, which causes the model to perform poorly when the noise is even slightly different. 这样的模型令人烦恼地能够记住训练数据中的任何噪声(异常值,测量误差等),这使得模型在噪声甚至略有不同时表现不佳。

(Apologies for posting this as an answer, the site wouldn't let me add a comment.) (抱歉发布此答案,该网站不会让我添加评论。)

An easy solution: Give each model a unique session and graph. 一个简单的解决方案:为每个模型提供唯一的会话和图表。

It works for this platform: TensorFlow 1.12.0, Keras 2.1.6-tf, Python 3.6.7, Jupyter Notebook. 它适用于这个平台:TensorFlow 1.12.0,Keras 2.1.6-tf,Python 3.6.7,Jupyter Notebook。

Key code: 关键代码:

with session.as_default():
    with session.graph.as_default():
        # do something about an ANN model

Full code: 完整代码:

import tensorflow as tf
from tensorflow import keras
import gc

def limit_memory():
    """ Release unused memory resources. Force garbage collection """
    keras.backend.clear_session()
    keras.backend.get_session().close()
    tf.reset_default_graph()
    gc.collect()
    #cfg = tf.ConfigProto()
    #cfg.gpu_options.allow_growth = True
    #keras.backend.set_session(tf.Session(config=cfg))
    keras.backend.set_session(tf.Session())
    gc.collect()


def create_and_train_ANN_model(hyper_parameter):
    print('create and train my ANN model')
    info = { 'result about this ANN model' }
    return info

for i in range(10):
    limit_memory()        
    session = tf.Session()
    keras.backend.set_session(session)
    with session.as_default():
        with session.graph.as_default():   
            hyper_parameter = { 'A set of hyper-parameters' }  
            info = create_and_train_ANN_model(hyper_parameter)      
    limit_memory()

Inspired by this link: Keras (Tensorflow backend) Error - Tensor input_1:0, specified in either feed_devices or fetch_devices was not found in the Graph 灵感来自以下链接: Keras(Tensorflow后端)错误 - Tensor input_1:0,在图表中找不到feed_devices或fetch_devices中指定的

I have the same issue.我有同样的问题。 My solution is to run from another script doing the following as many times and in as many hyperparameter configurations as you want.我的解决方案是从另一个脚本运行多次并在任意多的超参数配置中执行以下操作。

cmd = "python3 ./model_train.py hyperparameters"
os.system(cmd)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM