简体   繁体   English

为什么在使用Tensorflow加载批次进行训练时为什么会出现内存泄漏?

[英]Why do I have memory leaks when loading batches for training using Tensorflow?

I am training a neural network using batches of images. 我正在训练使用一批图像的神经网络。 I want to import the images in the for training loop to avoid importing all the images at once. 我想在for训练循环中导入图像,以避免一次导入所有图像。 Once a loop iteration is done, I want to use a new batch and forget about previous images to free the CPU memory. 循环迭代完成后,我想使用一个新的批处理,而忘记先前的映像以释放CPU内存。 I am reallocating the variables every time but the memory (and thus the running time) keeps on increasing. 我每次都在重新分配变量,但是内存(以及运行时间)不断增加。 Do you know how to free memory from the previous batches? 您知道如何从以前的批次中释放内存吗?

I am using Python 3.6.8 and Tensorflow 1.14.0 on a GPU Tesla K80 (memory_limit:11.3 GB). 我在GPU Tesla K80(memory_limit:11.3 GB)上使用Python 3.6.8和Tensorflow 1.14.0。

I have tried gc.collect() but it does not work. 我已经尝试过gc.collect(),但是它不起作用。

import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
import os
import time
import gc
import psutil

dir_img = "../dir_img_png/data/"
data = [os.path.join(dir_img, f) for f in os.listdir(dir_img) if os.path.isfile(os.path.join(dir_img, f))]
np.random.shuffle(data)
data_train = data[10:]
data_test = data[:10]

# Hyperparameters
input_size=256
batch_size=64
epochs = 20

def memory():
    pid = os.getpid()
    py = psutil.Process(pid)
    memoryUse = py.memory_info()[0]/2.**30  # memory use in GB
    print('memory use:', memoryUse)

def preprocess_image(path):
    raw_img = tf.read_file(path)
    img = tf.io.decode_png(raw_img, channels=1)
    img = tf.cast(img, tf.float32)
    img -= 127.5
    img *= 1./127.5
    return img

# Networks IO
real_images = tf.placeholder(tf.float32, [None, input_size, input_size, 1])

init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)

for epoch in range(epochs):
    for batch in range(len(data_train)//batch_size):
        tps = time.time()
        batch_images = np.array([sess.run(preprocess_image(d)) for d in data_train[batch*batch_size:batch*batch_size+batch_size]])
        print("tps1: {t}".format(t=time.time()-tps))

        tps2 = time.time()
        gc.collect()
        print("tps2: {t}".format(t=time.time()-tps2))

        memory()

sess.close()

Here is the output I get: 这是我得到的输出:

tps1: 10.445663928985596
tps2: 0.0995786190032959
memory use: 0.871917724609375
tps1: 9.142687320709229
tps2: 0.10767912864685059
memory use: 0.9062271118164062
tps1: 12.030094146728516
tps2: 0.10630679130554199
memory use: 0.9415740966796875
tps1: 13.415296077728271
tps2: 0.11185669898986816
memory use: 0.9608650207519531
tps1: 12.053950548171997
tps2: 0.11706066131591797
memory use: 0.9794692993164062
tps1: 14.279714584350586
tps2: 0.11610865592956543
memory use: 0.9980583190917969
tps1: 11.772900342941284
tps2: 0.12384176254272461
memory use: 1.0166587829589844
tps1: 15.43606686592102
tps2: 0.12571096420288086

The memory and running time keep on increasing. 内存和运行时间不断增加。

Every method tf.* (like tf.read_file, tf.io.decode_png or tf.cast) must be outside of the loop. 每个方法tf。*(例如tf.read_file,tf.io.decode_png或tf.cast)都必须在循环之外。

We need to define the graph before the loop and then run only the tensor we want : 我们需要在循环之前定义图,然后仅运行所需的张量:

import numpy as np
import tensorflow as tf
import os
import time
import psutil

dir_img = "../dir_img_png/data/"
data = [os.path.join(dir_img, f) for f in os.listdir(dir_img) if os.path.isfile(os.path.join(dir_img, f))]
np.random.shuffle(data)
data_train = data[10:]
data_test = data[:10]

# Hyperparameters
input_size=256
batch_size=64
epochs = 20

def memory():
    pid = os.getpid()
    py = psutil.Process(pid)
    memoryUse = py.memory_info()[0]/2.**30  # memory use in GB
    print('memory use:', memoryUse)

# Networks IO
img_path = tf.placeholder(tf.string)
real_images = tf.placeholder(tf.float32, [None, input_size, input_size, 1])

# Preprocess image
raw_img = tf.read_file(img_path)
png_img = tf.io.decode_png(raw_img, channels=1)
png_img = tf.cast(png_img, tf.float32)
png_img = tf.subtract(png_img, 127.5)
png_img = tf.divide(png_img, 127.5)

init = tf.global_variables_initializer()

sess = tf.Session()
sess.run(init)

for epoch in range(epochs):
    for batch in range(len(data_train)//batch_size):
        tps = time.time()
        batch_images = np.array([sess.run(png_img, feed_dict={img_path: d}) for d in data_train[batch*batch_size:batch*batch_size+batch_size]])
        print("tps1: {t}".format(t=time.time()-tps))

        memory()

sess.close()

The output shows then that the memory is constant: 然后输出显示内存是恒定的:

tps1: 11.155524492263794
memory use: 0.8702735900878906
tps1: 9.429716110229492
memory use: 0.8859291076660156
tps1: 9.62732195854187
memory use: 0.8859291076660156
tps1: 11.327840089797974
memory use: 0.8859291076660156
tps1: 8.580215215682983
memory use: 0.8859291076660156
tps1: 11.039035081863403
memory use: 0.8859291076660156
tps1: 12.866767168045044
memory use: 0.8859291076660156
tps1: 12.425249576568604
memory use: 0.8859291076660156

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 为什么在训练模型时会有如此不一致的结果? - Why do I have such inconsistent results when training my model? 当我没有足够的内存来加载所有训练数据时,如何在Keras中训练 - How to train in Keras when I don't have enough memory for loading all training data Tensorflow 将训练数据拆分为批次 - Tensorflow splitting training data to batches 为什么在 tensorflow 2 中使用 tf.GradientTape 进行训练与使用 fit API 进行训练有不同的行为? - Why does training using tf.GradientTape in tensorflow 2 have different behavior to training using fit API? 如何使用DataSet API在Tensorflow中为tf.train.SequenceExample数据创建填充批次? - How do I create padded batches in Tensorflow for tf.train.SequenceExample data using the DataSet API? 批量培训但是在Tensorflow中测试单个数据项? - Training in batches but testing individual data item in Tensorflow? 训练批次:哪种 Tensorflow 方法是正确的? - Training batches: which Tensorflow method is the right one? 为什么即使在 TPU 上训练时 tensorflow 也会吃掉这么多 RAM 以及如何释放训练后分配的内存 - Why tensorflow eats up so many RAM even when training on TPU & how to free the memory allocated after training 如何在 Tensorflow Batches 中将两个张量捆绑在一起? - How do I bundle two tensors together in Tensorflow Batches? 当我安装了 Tensorflow 的所有库时,为什么会出现 no directory 错误? - Why do I get a no directory error, when I have installed all the libraries for Tensorflow?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM