简体   繁体   中英

Clearing memory when training Machine Learning models with Tensorflow 1.15 on GPU

I am training a pretty intensive ML model using a GPU and what will often happen that if I start training the model, then let it train for a couple of epochs and notice that my changes have not made a significant difference in the loss/accuracy, I will make edits, re-initialize the model and re-start training from epoch 0. In this case, I often get OOM errors.

My guess is that despite me overriding all the model variables something is still taking up space in-memory.

Is there a way to clear the memory of the GPU in Tensorflow 1.15 so that I don't have to keep restarting the kernel each time I want to start training from scratch?

It depends on exactly what GPUs you're using. I'm assuming you're using NVIDIA, but even then depending on the exact GPU there are three ways to do this-

  1. nvidia-smi -r works on TESLA and other modern variants.
  2. nvidia-smi --gpu-reset works on a variety of older GPUs.
  3. Rebooting is the only options for the rest, unfortunately.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM