为什么在使用Tensorflow加载批次进行训练时为什么会出现内存泄漏？

Question

I am training a neural network using batches of images. 我正在训练使用一批图像的神经网络。 I want to import the images in the for training loop to avoid importing all the images at once. 我想在for训练循环中导入图像，以避免一次导入所有图像。 Once a loop iteration is done, I want to use a new batch and forget about previous images to free the CPU memory. 循环迭代完成后，我想使用一个新的批处理，而忘记先前的映像以释放CPU内存。 I am reallocating the variables every time but the memory (and thus the running time) keeps on increasing. 我每次都在重新分配变量，但是内存（以及运行时间）不断增加。 Do you know how to free memory from the previous batches? 您知道如何从以前的批次中释放内存吗？

I am using Python 3.6.8 and Tensorflow 1.14.0 on a GPU Tesla K80 (memory_limit:11.3 GB). 我在GPU Tesla K80（memory_limit：11.3 GB）上使用Python 3.6.8和Tensorflow 1.14.0。

I have tried gc.collect() but it does not work. 我已经尝试过gc.collect（），但是它不起作用。

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import time
import gc
import psutil

dir_img = "../dir_img_png/data/"
data = [os.path.join(dir_img, f) for f in os.listdir(dir_img) if os.path.isfile(os.path.join(dir_img, f))]
np.random.shuffle(data)
data_train = data[10:]
data_test = data[:10]

# Hyperparameters
input_size=256
batch_size=64
epochs = 20

def memory():
    pid = os.getpid()
    py = psutil.Process(pid)
    memoryUse = py.memory_info()[0]/2.**30  # memory use in GB
    print('memory use:', memoryUse)

def preprocess_image(path):
    raw_img = tf.read_file(path)
    img = tf.io.decode_png(raw_img, channels=1)
    img = tf.cast(img, tf.float32)
    img -= 127.5
    img *= 1./127.5
    return img

# Networks IO
real_images = tf.placeholder(tf.float32, [None, input_size, input_size, 1])

init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)

for epoch in range(epochs):
    for batch in range(len(data_train)//batch_size):
        tps = time.time()
        batch_images = np.array([sess.run(preprocess_image(d)) for d in data_train[batch*batch_size:batch*batch_size+batch_size]])
        print("tps1: {t}".format(t=time.time()-tps))

        tps2 = time.time()
        gc.collect()
        print("tps2: {t}".format(t=time.time()-tps2))

        memory()

sess.close()

Here is the output I get: 这是我得到的输出：

tps1: 10.445663928985596
tps2: 0.0995786190032959
memory use: 0.871917724609375
tps1: 9.142687320709229
tps2: 0.10767912864685059
memory use: 0.9062271118164062
tps1: 12.030094146728516
tps2: 0.10630679130554199
memory use: 0.9415740966796875
tps1: 13.415296077728271
tps2: 0.11185669898986816
memory use: 0.9608650207519531
tps1: 12.053950548171997
tps2: 0.11706066131591797
memory use: 0.9794692993164062
tps1: 14.279714584350586
tps2: 0.11610865592956543
memory use: 0.9980583190917969
tps1: 11.772900342941284
tps2: 0.12384176254272461
memory use: 1.0166587829589844
tps1: 15.43606686592102
tps2: 0.12571096420288086

The memory and running time keep on increasing. 内存和运行时间不断增加。

Answer 1

Every method tf.* (like tf.read_file, tf.io.decode_png or tf.cast) must be outside of the loop. 每个方法tf。*（例如tf.read_file，tf.io.decode_png或tf.cast）都必须在循环之外。

We need to define the graph before the loop and then run only the tensor we want : 我们需要在循环之前定义图，然后仅运行所需的张量：

import numpy as np
import tensorflow as tf
import os
import time
import psutil

dir_img = "../dir_img_png/data/"
data = [os.path.join(dir_img, f) for f in os.listdir(dir_img) if os.path.isfile(os.path.join(dir_img, f))]
np.random.shuffle(data)
data_train = data[10:]
data_test = data[:10]

# Hyperparameters
input_size=256
batch_size=64
epochs = 20

def memory():
    pid = os.getpid()
    py = psutil.Process(pid)
    memoryUse = py.memory_info()[0]/2.**30  # memory use in GB
    print('memory use:', memoryUse)

# Networks IO
img_path = tf.placeholder(tf.string)
real_images = tf.placeholder(tf.float32, [None, input_size, input_size, 1])

# Preprocess image
raw_img = tf.read_file(img_path)
png_img = tf.io.decode_png(raw_img, channels=1)
png_img = tf.cast(png_img, tf.float32)
png_img = tf.subtract(png_img, 127.5)
png_img = tf.divide(png_img, 127.5)

init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)

for epoch in range(epochs):
    for batch in range(len(data_train)//batch_size):
        tps = time.time()
        batch_images = np.array([sess.run(png_img, feed_dict={img_path: d}) for d in data_train[batch*batch_size:batch*batch_size+batch_size]])
        print("tps1: {t}".format(t=time.time()-tps))

        memory()

sess.close()

The output shows then that the memory is constant: 然后输出显示内存是恒定的：

tps1: 11.155524492263794
memory use: 0.8702735900878906
tps1: 9.429716110229492
memory use: 0.8859291076660156
tps1: 9.62732195854187
memory use: 0.8859291076660156
tps1: 11.327840089797974
memory use: 0.8859291076660156
tps1: 8.580215215682983
memory use: 0.8859291076660156
tps1: 11.039035081863403
memory use: 0.8859291076660156
tps1: 12.866767168045044
memory use: 0.8859291076660156
tps1: 12.425249576568604
memory use: 0.8859291076660156

为什么在使用Tensorflow加载批次进行训练时为什么会出现内存泄漏？

问题描述

1 个解决方案

解决方案1
0 2019-09-13 07:11:38

为什么在使用Tensorflow加载批次进行训练时为什么会出现内存泄漏？

问题描述

1 个解决方案

解决方案1 0 2019-09-13 07:11:38

解决方案1
0 2019-09-13 07:11:38