當存在 GPU 時，如何在 TensorFlow 中的單個腳本中訓練多個模型？

Question

假設我可以在一台機器上訪問多個 GPU（為了論證，假設在一台機器上有 8 個 GPU，每個 GPU 最大 memory，每個 8GB，具有一定數量的 RAM 和磁盤）。 我想在一個腳本和一台機器上運行一個程序，該程序在 TensorFlow 中評估多個模型（比如 50 或 200），每個模型都有不同的超參數設置（比如，步長、衰減率、批量大小、epochs/迭代等）。在訓練結束時，假設我們只是記錄它的准確性並擺脫 model（如果你想假設 model 經常被檢查指向，那么扔掉 model 並從頭開始訓練就可以了。你也可以假設可能會記錄一些其他數據，例如特定的超參數、訓練、驗證、訓練錯誤等）。

目前我有一個（偽）腳本，如下所示：

def train_multiple_modles_in_one_script_with_gpu(arg):
    '''
    trains multiple NN models in one session using GPUs correctly.

    arg = some obj/struct with the params for trianing each of the models.
    '''
    #### try mutliple models
    for mdl_id in range(100):
        #### define/create graph
        graph = tf.Graph()
        with graph.as_default():
            ### get mdl
            x = tf.placeholder(float_type, get_x_shape(arg), name='x-input')
            y_ = tf.placeholder(float_type, get_y_shape(arg))
            y = get_mdl(arg,x)
            ### get loss and accuracy
            loss, accuracy = get_accuracy_loss(arg,x,y,y_)
            ### get optimizer variables
            opt = get_optimizer(arg)
            train_step = opt.minimize(loss, global_step=global_step)
        #### run session
        with tf.Session(graph=graph) as sess:
            # train
            for i in range(nb_iterations):
                batch_xs, batch_ys = get_batch_feed(X_train, Y_train, batch_size)
                sess.run(fetches=train_step, feed_dict={x: batch_xs, y_: batch_ys})
                # check_point mdl
                if i % report_error_freq == 0:
                    sess.run(step.assign(i))
                    #
                    train_error = sess.run(fetches=loss, feed_dict={x: X_train, y_: Y_train})
                    test_error = sess.run(fetches=loss, feed_dict={x: X_test, y_: Y_test})
                    print( 'step %d, train error: %s test_error %s'%(i,train_error,test_error) )

本質上，它在一次運行中嘗試了許多模型，但它在單獨的圖中構建每個 model，並在單獨的 session 中運行每個模型。

我想我主要擔心的是我不清楚 tensorflow 是如何為要使用的 GPU 分配資源的。 例如，它是否僅在運行 session 時才加載（部分）數據集？ 當我創建一個圖形和一個 model 時，它是立即帶入 GPU 還是什么時候插入 GPU？ 每次嘗試新的 model 時，我是否需要清除/釋放 GPU？ 我實際上不太關心模型是否在多個 GPU 中並行運行（這可能是一個很好的補充），但我希望它首先串行運行所有內容而不會崩潰。 有什么特別的我需要做的才能讓它工作嗎？

目前我收到一個錯誤，開始如下：

I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats:
Limit:                   340000768
InUse:                   336114944
MaxInUse:                339954944
NumAllocs:                      78
MaxAllocSize:            335665152

W tensorflow/core/common_runtime/bfc_allocator.cc:274] ***************************************************xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
W tensorflow/core/common_runtime/bfc_allocator.cc:275] Ran out of memory trying to allocate 160.22MiB.  See logs for memory state.
W tensorflow/core/framework/op_kernel.cc:975] Resource exhausted: OOM when allocating tensor with shape[60000,700]

再往下說：

ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[60000,700]
         [[Node: standardNN/NNLayer1/Z1/add = Add[T=DT_FLOAT, _device="/job:localhost/replica:0/task:0/gpu:0"](standardNN/NNLayer1/Z1/MatMul, b1/read)]]

I tensorflow/core/common_runtime/gpu/gpu_device.cc:975] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla P100-SXM2-16GB, pci bus id: 0000:06:00.0)

然而，在 output 文件（它打印的地方）的更下方，它似乎可以打印出隨着訓練的進行應該顯示的錯誤/消息。 這是否意味着它沒有耗盡資源？ 或者它實際上能夠使用 GPU 嗎？ 如果它能夠使用CPU而不是CPU，為什么只有在GPU即將被使用時才會出現這個錯誤？

奇怪的是，數據集真的沒有那么大（所有 60K 點都是 24.5M），當我在自己的計算機上本地運行單個 model 時，該進程似乎使用了不到 5GB。 GPU 至少有 8GB，配備它們的計算機有足夠的 RAM 和磁盤（至少 16GB）。 因此，tensorflow 向我拋出的錯誤非常令人費解。 它試圖做什么，為什么會發生？ 有任何想法嗎？

在閱讀了建議使用多處理庫的答案后，我想出了以下腳本：

def train_mdl(args):
    train(mdl,args)

if __name__ == '__main__':
    for mdl_id in range(100):
        # train one model with some specific hyperparms (assume they are chosen randomly inside the funciton bellow or read from a config file or they could just be passed or something)
        p = Process(target=train_mdl, args=(args,))
        p.start()
        p.join()
    print('Done training all models!')

老實說，我不確定為什么他的回答建議使用池，或者為什么會有奇怪的元組括號，但這對我來說是有意義的。 在上述循環中每次創建新進程時，tensorflow 的資源是否會重新分配？

Answer 1

我認為在一個單一的腳本中運行所有模型從長遠來看可能是不好的做法（請參閱下面的建議以獲得更好的替代方案）。 但是，如果您想這樣做，這里有一個解決方案：您可以使用multiprocessing模塊將TF會話封裝到一個進程中，這將確保TF在進程完成后釋放會話內存。 這是一段代碼：

from multiprocessing import Pool
import contextlib
def my_model((param1, param2, param3)): # Note the extra (), required by the pool syntax
    < your code >

num_pool_worker=1 # can be bigger than 1, to enable parallel execution 
with contextlib.closing(Pool(num_pool_workers)) as po: # This ensures that the processes get closed once they are done
     pool_results = po.map_async(my_model,
                                    ((param1, param2, param3)
                                     for param1, param2, param3 in params_list))
     results_list = pool_results.get()

OP注意：如果您選擇使用隨機數生成器種子，則不會使用多處理庫自動重置。 詳細信息：對每個進程使用帶有不同隨機種子的python多處理

關於TF資源分配：通常TF分配的資源比它需要的多得多。 很多時候，您可以限制每個進程使用總GPU內存的一小部分，並通過反復試驗發現腳本所需的分數。

您可以使用以下代碼段執行此操作

gpu_memory_fraction = 0.3 # Choose this number through trial and error
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction,)
session_config = tf.ConfigProto(gpu_options=gpu_options)
sess = tf.Session(config=session_config, graph=graph)

請注意，有時TF會增加內存使用量以加快執行速度。 因此，減少內存使用量可能會使模型運行速度變慢。

您編輯/評論中的新問題的答案：

是的，每次創建新流程時都會重新分配Tensorflow，並在流程結束后清除。
編輯中的for循環也應該完成這項工作。 我建議使用Pool，因為它可以讓你在一個GPU上同時運行多個模型。 請參閱我關於設置gpu_memory_fraction和“選擇最大進程數”的說明。 另請注意：（1）Pool map為您運行循環，因此一旦使用它就不需要外部for循環。 （2）在你的例子中，你應該在調用train（）之前有類似mdl=get_model(args)東西
奇怪的元組括號：Pool只接受一個參數，因此我們使用一個元組來傳遞多個參數。 有關更多詳細信息，請參閱multiprocessing.pool.map和帶有兩個參數的函數。 正如一個答案中所建議的那樣，你可以使它更具可讀性
```
 def train_mdl(params): (x,y)=params < your code > 
```
正如@Seven建議的那樣，您可以使用CUDA_VISIBLE_DEVICES環境變量來選擇要用於您的進程的GPU。 您可以在過程函數（ train_mdl ）的開頭使用以下內容在python腳本中執行此操作。
```
 import os # the import can be on the top of the python script os.environ["CUDA_VISIBLE_DEVICES"] = "{}".format(gpu_id) 
```

執行實驗的更好方法是將訓練/評估代碼與超參數/模型搜索代碼隔離開來。 例如，有一個名為train.py的腳本，它接受超參數的特定組合和對數據的引用作為參數，並對單個模型執行訓練。

然后，要迭代所有可能的參數組合，您可以使用簡單的任務（作業）隊列，並將所有可能的超參數組合作為單獨的作業提交。 任務隊列將一次向您的計算機提供一個作業。 通常，您還可以將隊列設置為同時執行多個進程（請參閱下面的詳細信息）。

具體來說，我使用任務假脫機程序，它非常容易安裝和少數（不需要管理員權限，詳情如下）。

基本用法是（請參閱下面有關任務假脫機程序使用情況的說明）：

ts <your-command>

在實踐中，我有一個單獨的python腳本來管理我的實驗，設置每個特定實驗的所有參數並將作業發送到ts隊列。

以下是我的實驗經理的python代碼的一些相關摘要：

run_bash執行bash命令

def run_bash(cmd):
    p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, executable='/bin/bash')
    out = p.stdout.read().strip()
    return out  # This is the stdout from the shell command

下一個代碼段設置要運行的並發進程數（請參閱下面有關選擇最大進程數的說明）：

max_job_num_per_gpu = 2
run_bash('ts -S %d'%max_job_num_per_gpu)

下一個片段迭代了超級參數/模型參數的所有組合的列表。 列表的每個元素都是一個字典，其中鍵是train.py腳本的命令行參數

for combination_dict in combinations_list:

    job_cmd = 'python train.py ' + '  '.join(
            ['--{}={}'.format(flag, value) for flag, value in combination_dict.iteritems()])

    submit_cmd = "ts bash -c '%s'" % job_cmd
    run_bash(submit_cmd)

關於選擇最大進程數的說明：

如果您缺少GPU，可以使用找到的gpu_memory_fraction ，將進程數設置為max_job_num_per_gpu=int(1/gpu_memory_fraction)

有關任務假脫機程序（ ts ）的說明：

您可以使用以下命令設置要運行的並發進程數（“slots”）：
ts -S <number-of-slots>
安裝ts不需要管理員權限。 您可以使用簡單的make從源代碼下載並編譯它，將其添加到您的路徑中，您就完成了。
您可以設置多個隊列（我將它用於多個GPU）
TS_SOCKET=<path_to_queue_name> ts <your-command>

例如
TS_SOCKET=/tmp/socket-ts.gpu_queue_1 ts <your-command>

TS_SOCKET=/tmp/socket-ts.gpu_queue_2 ts <your-command>
有關更多用法示例，請參見此處

關於自動設置路徑名和文件名的注意事項：將主代碼與實驗管理器分開后，您需要一種有效的方法來生成文件名和目錄名，給定超級參數。 我通常將我的重要超級參數保存在字典中，並使用以下函數從字典鍵值對生成單個鏈式字符串。 以下是我用來執行此操作的函數：

def build_string_from_dict(d, sep='%'):
    """
     Builds a string from a dictionary.
     Mainly used for formatting hyper-params to file names.
     Key-value pairs are sorted by the key name.

    Args:
        d: dictionary

    Returns: string
    :param d: input dictionary
    :param sep: key-value separator

    """

    return sep.join(['{}={}'.format(k, _value2str(d[k])) for k in sorted(d.keys())])


def _value2str(val):
    if isinstance(val, float): 
        # %g means: "Floating point format.
        # Uses lowercase exponential format if exponent is less than -4 or not less than precision,
        # decimal format otherwise."
        val = '%g' % val
    else:
        val = '{}'.format(val)
    val = re.sub('\.', '_', val)
    return val

Answer 2

據我所知，首先，tensorflow構造一個符號圖，並根據鏈規則推斷出衍生物。 然后為所有（必要的）張量分配內存，包括一些層的輸入和輸出以提高效率。 運行會話時，數據將加載到圖形中，但通常情況下，內存使用不會再發生變化。

我猜測，您遇到的錯誤可能是由在一個GPU中構建多個模型引起的。

正如@ user2476373所提出的那樣，將訓練/評估代碼與超參數隔離是一個不錯的選擇。 但我直接使用bash腳本，而不是任務假脫機程序（可能更方便），例如

CUDA_VISIBLE_DEVICES=0 python train.py --lrn_rate 0.01 --weight_decay_rate 0.001 --momentum 0.9 --batch_size 8 --max_iter 60000 --snapshot 5000
CUDA_VISIBLE_DEVICES=0 python eval.py

或者你可以在bash腳本中編寫一個'for'循環，不一定在python腳本中。 注意到我在腳本開頭使用了CUDA_VISIBLE_DEVICES=0 （如果你在一台機器上有8個GPU，索引可能是7）。 因為根據我的經驗，我發現tensorflow使用一台機器上的所有GPU，如果我沒有指定操作使用哪個GPU與這樣的代碼

with tf.device('/gpu:0'):

如果你想嘗試多GPU實現，有一些例子。

希望這可以幫到你。

Answer 3

你可能不想這樣做。

如果您在數據上運行成千上萬的模型，並選擇評估最佳的模型，那么您就不會進行機器學習; 相反，您正在記憶您的數據集，並且無法保證您選擇的模型將在該數據集之外執行。

換句話說，這種方法類似於擁有單一模型，該模型具有數千個自由度。 擁有如此高復雜度的模型是有問題的，因為它能夠比實際保證更好地適應您的數據; 這樣的模型令人煩惱地能夠記住訓練數據中的任何噪聲（異常值，測量誤差等），這使得模型在噪聲甚至略有不同時表現不佳。

（抱歉發布此答案，該網站不會讓我添加評論。）

Answer 4

一個簡單的解決方案：為每個模型提供唯一的會話和圖表。

它適用於這個平台：TensorFlow 1.12.0，Keras 2.1.6-tf，Python 3.6.7，Jupyter Notebook。

關鍵代碼：

with session.as_default():
    with session.graph.as_default():
        # do something about an ANN model

完整代碼：

import tensorflow as tf
from tensorflow import keras
import gc

def limit_memory():
    """ Release unused memory resources. Force garbage collection """
    keras.backend.clear_session()
    keras.backend.get_session().close()
    tf.reset_default_graph()
    gc.collect()
    #cfg = tf.ConfigProto()
    #cfg.gpu_options.allow_growth = True
    #keras.backend.set_session(tf.Session(config=cfg))
    keras.backend.set_session(tf.Session())
    gc.collect()


def create_and_train_ANN_model(hyper_parameter):
    print('create and train my ANN model')
    info = { 'result about this ANN model' }
    return info

for i in range(10):
    limit_memory()        
    session = tf.Session()
    keras.backend.set_session(session)
    with session.as_default():
        with session.graph.as_default():   
            hyper_parameter = { 'A set of hyper-parameters' }  
            info = create_and_train_ANN_model(hyper_parameter)      
    limit_memory()

靈感來自以下鏈接： Keras（Tensorflow后端）錯誤 - Tensor input_1：0，在圖表中找不到feed_devices或fetch_devices中指定的

Answer 5

我有同樣的問題。 我的解決方案是從另一個腳本運行多次並在任意多的超參數配置中執行以下操作。

cmd = "python3 ./model_train.py hyperparameters"
os.system(cmd)

當存在 GPU 時，如何在 TensorFlow 中的單個腳本中訓練多個模型？

問題描述

5 個解決方案

解決方案1
16 已采納 2017-03-07 12:11:36

解決方案2
2 2017-03-07 14:27:52

解決方案3
0 2017-03-11 11:28:47

解決方案4
0 2019-05-09 01:15:28

解決方案5
0 2023-01-31 16:07:56

當存在 GPU 時，如何在 TensorFlow 中的單個腳本中訓練多個模型？

問題描述

5 個解決方案

解決方案1 16 已采納 2017-03-07 12:11:36

解決方案2 2 2017-03-07 14:27:52

解決方案3 0 2017-03-11 11:28:47

解決方案4 0 2019-05-09 01:15:28

解決方案5 0 2023-01-31 16:07:56

解決方案1
16 已采納 2017-03-07 12:11:36

解決方案2
2 2017-03-07 14:27:52

解決方案3
0 2017-03-11 11:28:47

解決方案4
0 2019-05-09 01:15:28

解決方案5
0 2023-01-31 16:07:56