简体繁体中英

Clearing memory when training Machine Learning models with Tensorflow 1.15 on GPU

原文 2021-03-04 20:00:40 2 1 python-3.x/ tensorflow/ gpu

I am training a pretty intensive ML model using a GPU and what will often happen that if I start training the model, then let it train for a couple of epochs and notice that my changes have not made a significant difference in the loss/accuracy, I will make edits, re-initialize the model and re-start training from epoch 0. In this case, I often get OOM errors.

My guess is that despite me overriding all the model variables something is still taking up space in-memory.

Is there a way to clear the memory of the GPU in Tensorflow 1.15 so that I don't have to keep restarting the kernel each time I want to start training from scratch?

1 answers

It depends on exactly what GPUs you're using. I'm assuming you're using NVIDIA, but even then depending on the exact GPU there are three ways to do this-

nvidia-smi -r works on TESLA and other modern variants.
nvidia-smi --gpu-reset works on a variety of older GPUs.
Rebooting is the only options for the rest, unfortunately.

Keras (TensorFlow, CPU): Training Sequential models in loop eats memory

Converting a tensor to a a numpy array inside custom training loop in TensorFlow 1.15

Overfitting/Underfitting Machine Learning Models with Azure Machine Learning vs Python

Install TensorFlow 1.15 on Windows

Multi GPU training slower than single GPU on Tensorflow

Why can't I see the local epochs output when training tensorflow federated learning model?

Multiple errors when trying to install tensorflow gpu beta 2.0 in virtual env on a windows machine using pip

“[Errno 38] Function not implemented:” occurred when installing Tensorflow Object Detection API on Azure Machine Learning

Tensorflow supports multiple threads/streams on one GPU for training?

How to save and snapshot machine learning model during a single training?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Keras (TensorFlow, CPU): Training Sequential models in loop eats memory Converting a tensor to a a numpy array inside custom training loop in TensorFlow 1.15 Overfitting/Underfitting Machine Learning Models with Azure Machine Learning vs Python Install TensorFlow 1.15 on Windows Multi GPU training slower than single GPU on Tensorflow Why can't I see the local epochs output when training tensorflow federated learning model? Multiple errors when trying to install tensorflow gpu beta 2.0 in virtual env on a windows machine using pip “[Errno 38] Function not implemented:” occurred when installing Tensorflow Object Detection API on Azure Machine Learning Tensorflow supports multiple threads/streams on one GPU for training? How to save and snapshot machine learning model during a single training?

Related Tags

Clearing memory when training Machine Learning models with Tensorflow 1.15 on GPU

Question

1 answers

solution1 0 2021-03-04 20:48:54

solution1
0 2021-03-04 20:48:54