简体   繁体   English

如何预处理一个巨大的数据集并保存它以便我可以在 Python 中训练数据

[英]How to preprocess a huge dataset and save it such that I can train the data in Python

I want to preprocess a huge dataset(600k) of images to be used to train a model.我想预处理用于训练模型的巨大图像数据集(600k)。 However, it is taking too much memory and I have been searching for solutions but not one suit my problem here.但是,它占用了太多内存,我一直在寻找解决方案,但在这里没有一个适合我的问题。 Here is part of my code.这是我的代码的一部分。 I'm still new to deep learning and I think I did a bad job on preprocessing the data.我还是深度学习的新手,我认为我在预处理数据方面做得很差。 If anyone knows how to solve this memory issue it would be greatly appreciated.如果有人知道如何解决这个内存问题,将不胜感激。

# Read the CSV File
data_frame = pd.read_csv("D:\\Downloads\\ndsc-beginner\\train.csv")

#Load the image
def load_image(img_path, target_size=(256, 256)):
    #Check if the img_path has .jpg behind the name
    if img_path[-4:] != '.jpg':
        # Load the image
        img = load_img(img_path+'.jpg',
                       target_size=target_size, grayscale=True)
    else:
        #Load the image
        img = load_img(img_path, target_size=target_size, grayscale=True)
    # Convert to a numpy array
    return img_to_array(img) 


IMG_SIZE = 256
image_arr = []
# Get the category column values
category_id = data_frame['Category']
# Change the category to one-hot - has 50 categories
dummy_cat_id = keras.utils.np_utils.to_categorical(category_id, 50)
# Get the image paths column values
path_list = data_frame.iloc[1:, -1]
# Batch generator
def batch_gen(data, batch_size):
    for i in range(0, len(data), batch_size):
        yield data[i:i+batch_size]
# Append the numpy array(img) and category label into an array.
def extract_data(data_frame):
    total_count = len(path_list)
    batch_size = 1000
    index = 0
    for path in batch_gen(path_list,batch_size):
        for mini_path in path:
            image_arr.append([load_image(mini_path), dummy_cat_id[index]])
            print(index)
            index+= 1

#extract_data(data_frame)
random.shuffle(image_arr)


# Features and Labels for training data
trainImages = np.array([i[0] for i in image_arr]
                      ).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
trainLabels = np.array([i[1] for i in image_arr])

trainImages = trainImages.astype('float32')
trainImages /= 255.0

I see that in preprocessing you are just making the images grayscale and normalizing them.我看到在预处理中,您只是将图像灰度化并对其进行标准化。 If you are using Keras you can use the following that will normalize as well as convert your images into grayscale Make sure you give the path which contains classes folders in which images are located.如果您使用的是 Keras,您可以使用以下内容进行标准化并将您的图像转换为灰度确保您提供包含图像所在的类文件夹的路径。 you can change the class mode to categorical if you want如果需要,您可以将课程模式更改为分类模式

train_datagen = ImageDataGenerator(rescale=1./255)
train_gen = train_datagen.flow_from_directory(
        f'{readPath}/training/',
        target_size=(100,100),
        color_mode='grayscale',
        batch_size=32,
        classes=['cat','dog'],
        class_mode='binary'
    )

To train you can use model.fit_generator() function要训​​练,您可以使用 model.fit_generator() 函数

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用提供的需要 tf.Tensor 的 preprocess_input function 预处理 tf.data.Dataset? - How can I preprocess a tf.data.Dataset using a provided preprocess_input function that expects a tf.Tensor? 有没有更快的方法来预处理 Python 中的大量文本数据? - Is there a faster way to preprocess huge amount of text data in Python? 如何预处理我的图像,以便 SVM 可以像处理 MNIST 数据集一样处理它 - How can I preprocess my image so it can be processed by a SVM in the same way it processes the MNIST dataset 如何在主数据集中找到 X_train 索引? - How can I find X_train indexes in the main dataset? 如何将此数据集拆分为训练集、验证集和测试集? - How can I split this dataset into train, validation, and test set? 如何正确拆分不平衡数据集以训练和测试集? - How can I properly split imbalanced dataset to train and test set? 我如何优化我在庞大数据集上的嵌入转换? - How can i optimize my Embedding transformation on a huge dataset? 我未能训练 CNN + LSTM model。 我怎么解决这个问题? 数据集有问题吗? 还是 model? (Python 3.8x) - I failed to train CNN + LSTM model. How can I solve this problem? Is it have problem in dataset? or model? (Python 3.8x) 如何在Python中预处理时间序列数据以进行预测 - How to preprocess time series data in Python for forecasting 如何使用python预处理Twitter文本数据 - How to preprocess twitter text data using python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM