[英]Keras/Tensorflow: Train multiple models on the same GPU in a loop or using Process
I have multiple models to train in Keras/Tensorflow at a stretch one after the other without manually calling train.py
, so I did:我有多个模型可以一个接一个地在 Keras/Tensorflow 中进行训练,而无需手动调用train.py
,所以我这样做了:
for i in range(0, max_count):
model = get_model(i) # returns ith model
model.fit(...)
model.save(...)
It runs fine for the i=0
(and in fact runs perfectly when run separately).它对于i=0
运行良好(实际上单独运行时运行良好)。 The problem is that when the second time model is loaded, i get ResourceExhaustedError OOM
, so I tried to release memory at the end of for loop问题是当第二次加载模型时,我得到ResourceExhaustedError OOM
,所以我试图在 for 循环结束时释放内存
del model
keras.backend.clear_session()
tf.clear_session()
tf.reset_default_graph()
gc.collect()
none of which individually or collectively works.没有一个单独或集体起作用。
I looked up further and found that the only way to release GPU memory is to end the process.我进一步查找,发现释放GPU内存的唯一方法是结束进程。
Also rom this keras issue还有这个 keras问题
Update (2018/08/01): Currently only TensorFlow backend supports proper cleaning up of the session.更新 (2018/08/01):目前只有 TensorFlow 后端支持正确清理会话。 This can be done by calling K.clear_session().这可以通过调用 K.clear_session() 来完成。 This will remove EVERYTHING from memory (models, optimizer objects and anything that has tensors internally).这将从内存中删除所有内容(模型、优化器对象和内部具有张量的任何内容)。 So there is no way to remove a specific stale model.因此,无法删除特定的陈旧模型。 This is not a bug of Keras but a limitation of the backends.这不是 Keras 的错误,而是后端的限制。
So clearly the way to go is to create a process everytime I load a model and wait for it to end and then create another one in a fresh process like here:很明显,要走的路是每次我加载模型并等待它结束时创建一个流程,然后在新流程中创建另一个流程,如下所示:
import multitasking
def train_model_in_new_process(model_module, kfold_object, x, y, x_test, y_test, epochs, model_file_count):
training_process = multiprocessing.Process(target=train_model, args=(x, y, x_test, y_test, epochs, model_file_count, ))
training_process.start()
training_process.join()
but then it throws this error:但随后它会引发此错误:
File "train.py", line 110, in train_model_in_new_process
training_process.start()
File "F:\Python\envs\tensorflow\lib\multiprocessing\process.py", line 105, in start
self._popen = self._Popen(self)
File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 223, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
File "F:\Python\envs\tensorflow\lib\multiprocessing\context.py", line 322, in _Popen
return Popen(process_obj)
File "F:\Python\envs\tensorflow\lib\multiprocessing\popen_spawn_win32.py", line 65, in __init__
reduction.dump(process_obj, to_child)
File "F:\Python\envs\tensorflow\lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
TypeError: can't pickle module objects
Using TensorFlow backend.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 105, in spawn_main
exitcode = _main(fd)
File "F:\Python\envs\tensorflow\lib\multiprocessing\spawn.py", line 115, in _main
self = reduction.pickle.load(from_parent)
EOFError: Ran out of input
I really can't utilize the information presented in the error to see what I was doing wrong.我真的无法利用错误中提供的信息来查看我做错了什么。 It is clearly pointing at the line training_process.start()
, but I can't seem to understand what's causing the problem.它显然指向training_process.start()
行,但我似乎无法理解导致问题的原因。
Any help to train models either using for
loop or using Process
is appreciated.感谢使用for
循环或使用Process
训练模型的任何帮助。
Apparently, Multiprocessing
doesn't like modules
or more precisely importlib
modules.显然, Multiprocessing
不喜欢modules
或更准确地说importlib
模块。 I was loading models from numbered .py
files using importlib
我正在使用importlib
从编号的.py
文件中加载模型
model_module = importlib.import_module(model_file)
and hence the trouble.因此麻烦。
I did the same inside the Process
and it was all fine :)我在Process
中做了同样的事情,一切都很好:)
But I still could NOT find a way to do this without Process
es, using for
s.但是如果没有Process
es,使用for
s,我仍然无法找到一种方法。 If you have an answer, please post it here, you're welcome.如果您有答案,请在此处发布,不客气。 But anyway, I'm continuing with processes, because processes are, I believe, cleaner in a way that they are isolated and clears all the memory allocated for that specific one when done.但无论如何,我将继续处理进程,因为我相信进程在某种程度上更干净,它们被隔离并在完成后清除为该特定进程分配的所有内存。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.