简体   繁体   English

Keras/Tensorflow:在同一个 GPU 上循环或使用 Process 训练多个模型

[英]Keras/Tensorflow: Train multiple models on the same GPU in a loop or using Process

I have multiple models to train in Keras/Tensorflow at a stretch one after the other without manually calling train.py , so I did:我有多个模型可以一个接一个地在 Keras/Tensorflow 中进行训练,而无需手动调用train.py ,所以我这样做了:

for i in range(0, max_count):
    model = get_model(i)   # returns ith model
    model.fit(...)
    model.save(...)

It runs fine for the i=0 (and in fact runs perfectly when run separately).它对于i=0运行良好(实际上单独运行时运行良好)。 The problem is that when the second time model is loaded, i get ResourceExhaustedError OOM , so I tried to release memory at the end of for loop问题是当第二次加载模型时,我得到ResourceExhaustedError OOM ,所以我试图在 for 循环结束时释放内存

del model
keras.backend.clear_session()
tf.clear_session()
tf.reset_default_graph()
gc.collect()

none of which individually or collectively works.没有一个单独或集体起作用。

I looked up further and found that the only way to release GPU memory is to end the process.我进一步查找,发现释放GPU内存的唯一方法是结束进程。

Also rom this keras issue还有这个 keras问题

Update (2018/08/01): Currently only TensorFlow backend supports proper cleaning up of the session.更新 (2018/08/01):目前只有 TensorFlow 后端支持正确清理会话。 This can be done by calling K.clear_session().这可以通过调用 K.clear_session() 来完成。 This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally).这将从内存中删除所有内容(模型、优化器对象和内部具有张量的任何内容)。 So there is no way to remove a specific stale model.因此,无法删除特定的陈旧模型。 This is not a bug of Keras but a limitation of the backends.这不是 Keras 的错误,而是后端的限制。

So clearly the way to go is to create a process everytime I load a model and wait for it to end and then create another one in a fresh process like here:很明显,要走的路是每次我加载模型并等待它结束时创建一个流程,然后在新流程中创建另一个流程,如下所示:

import multitasking
def train_model_in_new_process(model_module, kfold_object, x, y, x_test, y_test, epochs, model_file_count):
    training_process = multiprocessing.Process(target=train_model, args=(x, y, x_test, y_test, epochs, model_file_count, ))
    training_process.start()
    training_process.join()

but then it throws this error:但随后它会引发此错误:

  File "train.py", line 110, in train_model_in_new_process
    training_process.start()
  File "F:\Python\envs\tensorflow\lib\multiprocessing\process.py", line 105, in start
    self._popen = self._Popen(self)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 223, in _Popen
    return _default_context.get_context().Process._Popen(process_obj)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
    reduction.dump(process_obj, to_child)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects
Using TensorFlow backend.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 115, in _main
    self = reduction.pickle.load(from_parent)
EOFError: Ran out of input

I really can't utilize the information presented in the error to see what I was doing wrong.我真的无法利用错误中提供的信息来查看我做错了什么。 It is clearly pointing at the line training_process.start() , but I can't seem to understand what's causing the problem.它显然指向training_process.start()行,但我似乎无法理解导致问题的原因。

Any help to train models either using for loop or using Process is appreciated.感谢使用for循环或使用Process训练模型的任何帮助。

Apparently, Multiprocessing doesn't like modules or more precisely importlib modules.显然, Multiprocessing不喜欢modules或更准确地说importlib模块。 I was loading models from numbered .py files using importlib我正在使用importlib从编号的.py文件中加载模型

model_module = importlib.import_module(model_file)

and hence the trouble.因此麻烦。

I did the same inside the Process and it was all fine :)我在Process中做了同样的事情,一切都很好:)

But I still could NOT find a way to do this without Process es, using for s.但是如果没有Process es,使用for s,我仍然无法找到一种方法。 If you have an answer, please post it here, you're welcome.如果您有答案,请在此处发布,不客气。 But anyway, I'm continuing with processes, because processes are, I believe, cleaner in a way that they are isolated and clears all the memory allocated for that specific one when done.但无论如何,我将继续处理进程,因为我相信进程在某种程度上更干净,它们被隔离并在完成后清除为该特定进程分配的所有内存。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM