如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据

Question

I want to preprocess a huge dataset(600k) of images to be used to train a model.我想预处理用于训练模型的巨大图像数据集（600k）。 However, it is taking too much memory and I have been searching for solutions but not one suit my problem here.但是，它占用了太多内存，我一直在寻找解决方案，但在这里没有一个适合我的问题。 Here is part of my code.这是我的代码的一部分。 I'm still new to deep learning and I think I did a bad job on preprocessing the data.我还是深度学习的新手，我认为我在预处理数据方面做得很差。 If anyone knows how to solve this memory issue it would be greatly appreciated.如果有人知道如何解决这个内存问题，将不胜感激。

# Read the CSV File
data_frame = pd.read_csv("D:\\Downloads\\ndsc-beginner\\train.csv")

#Load the image
def load_image(img_path, target_size=(256, 256)):
    #Check if the img_path has .jpg behind the name
    if img_path[-4:] != '.jpg':
        # Load the image
        img = load_img(img_path+'.jpg',
                       target_size=target_size, grayscale=True)
    else:
        #Load the image
        img = load_img(img_path, target_size=target_size, grayscale=True)
    # Convert to a numpy array
    return img_to_array(img) 


IMG_SIZE = 256
image_arr = []
# Get the category column values
category_id = data_frame['Category']
# Change the category to one-hot - has 50 categories
dummy_cat_id = keras.utils.np_utils.to_categorical(category_id, 50)
# Get the image paths column values
path_list = data_frame.iloc[1:, -1]
# Batch generator
def batch_gen(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i+batch_size]
# Append the numpy array(img) and category label into an array.
def extract_data(data_frame):
    total_count = len(path_list)
    batch_size = 1000
    index = 0
    for path in batch_gen(path_list,batch_size):
        for mini_path in path:
            image_arr.append([load_image(mini_path), dummy_cat_id[index]])
            print(index)
            index+= 1

#extract_data(data_frame)
random.shuffle(image_arr)


# Features and Labels for training data
trainImages = np.array([i[0] for i in image_arr]
                      ).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
trainLabels = np.array([i[1] for i in image_arr])

trainImages = trainImages.astype('float32')
trainImages /= 255.0

Answer 1

I see that in preprocessing you are just making the images grayscale and normalizing them.我看到在预处理中，您只是将图像灰度化并对其进行标准化。 If you are using Keras you can use the following that will normalize as well as convert your images into grayscale Make sure you give the path which contains classes folders in which images are located.如果您使用的是 Keras，您可以使用以下内容进行标准化并将您的图像转换为灰度确保您提供包含图像所在的类文件夹的路径。 you can change the class mode to categorical if you want如果需要，您可以将课程模式更改为分类模式

train_datagen = ImageDataGenerator(rescale=1./255)
train_gen = train_datagen.flow_from_directory(
        f'{readPath}/training/',
        target_size=(100,100),
        color_mode='grayscale',
        batch_size=32,
        classes=['cat','dog'],
        class_mode='binary'
    )

To train you can use model.fit_generator() function要训练，您可以使用 model.fit_generator() 函数

如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据

问题描述

1 个解决方案

解决方案1
0 2019-03-04 05:57:39

如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据

问题描述

1 个解决方案

解决方案1 0 2019-03-04 05:57:39

解决方案1
0 2019-03-04 05:57:39