I want to preprocess a huge dataset(600k) of images to be used to train a model. However, it is taking too much memory and I have been searching for solutions but not one suit my problem here. Here is part of my code. I'm still new to deep learning and I think I did a bad job on preprocessing the data. If anyone knows how to solve this memory issue it would be greatly appreciated.
# Read the CSV File
data_frame = pd.read_csv("D:\\Downloads\\ndsc-beginner\\train.csv")
#Load the image
def load_image(img_path, target_size=(256, 256)):
#Check if the img_path has .jpg behind the name
if img_path[-4:] != '.jpg':
# Load the image
img = load_img(img_path+'.jpg',
target_size=target_size, grayscale=True)
else:
#Load the image
img = load_img(img_path, target_size=target_size, grayscale=True)
# Convert to a numpy array
return img_to_array(img)
IMG_SIZE = 256
image_arr = []
# Get the category column values
category_id = data_frame['Category']
# Change the category to one-hot - has 50 categories
dummy_cat_id = keras.utils.np_utils.to_categorical(category_id, 50)
# Get the image paths column values
path_list = data_frame.iloc[1:, -1]
# Batch generator
def batch_gen(data, batch_size):
for i in range(0, len(data), batch_size):
yield data[i:i+batch_size]
# Append the numpy array(img) and category label into an array.
def extract_data(data_frame):
total_count = len(path_list)
batch_size = 1000
index = 0
for path in batch_gen(path_list,batch_size):
for mini_path in path:
image_arr.append([load_image(mini_path), dummy_cat_id[index]])
print(index)
index+= 1
#extract_data(data_frame)
random.shuffle(image_arr)
# Features and Labels for training data
trainImages = np.array([i[0] for i in image_arr]
).reshape(-1, IMG_SIZE, IMG_SIZE, 1)
trainLabels = np.array([i[1] for i in image_arr])
trainImages = trainImages.astype('float32')
trainImages /= 255.0
I see that in preprocessing you are just making the images grayscale and normalizing them. If you are using Keras you can use the following that will normalize as well as convert your images into grayscale Make sure you give the path which contains classes folders in which images are located. you can change the class mode to categorical if you want
train_datagen = ImageDataGenerator(rescale=1./255)
train_gen = train_datagen.flow_from_directory(
f'{readPath}/training/',
target_size=(100,100),
color_mode='grayscale',
batch_size=32,
classes=['cat','dog'],
class_mode='binary'
)
To train you can use model.fit_generator() function
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.