[英]Why does my cross-validation consistently perform better than train-test split?
[英]Why does my kernel die every time I run train-test split on this particular dataset?
我以前使用過訓練測試拆分並且沒有任何問題。 我的 CNN 有一個相當大的 (1GB) 數據集並嘗試使用它,但我的 kernel 每次都死機。 我讀過有時輸入shuffle=False
會有所幫助。 我試過了,但沒有運氣。 我在下面包含了我的代碼。 任何幫助,將不勝感激!!
import pandas as pd
import os
import cv2
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from PIL import Image
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score
np.random.seed(42)
data_dir='birds/'
train_path=data_dir+'/train'
test_path=data_dir+'/test'
img_size=(100,100)
channels=3
num_categories=len(os.listdir(train_path))
#get list of each category to zip
names_of_species=[]
for i in os.listdir(train_path):
names_of_species.append(i)
#make list of numbers from 1-300:
num_list=[]
for i in range(300):
num_list.append(i)
nums_and_names=dict(zip(num_list, names_of_species))
folders=os.listdir(train_path)
import random
from matplotlib.image import imread
df=pd.read_csv(data_dir+'/Bird_Species.csv')
img_data=[]
img_labels=[]
for i in nums_and_names:
path=data_dir+'train/'+str(names_of_species[i])
images=os.listdir(path)
for img in images:
try:
image=cv2.imread(path+'/'+img)
image_fromarray=Image.fromarray(image, 'RGB')
resize_image=image_fromarray.resize((img_size))
img_data.append(np.array(resize_image))
img_labels.append(num_list[i])
except:
print("Error in "+img)
img_data=np.array(img_data)
img_labels=np.array(img_labels)
img_labels
array([210, 41, 148, ..., 15, 115, 292])
#SHUFFLE TRAINING DATA
shuffle_indices=np.arange(img_data.shape[0])
np.random.shuffle(shuffle_indices)
img_data=img_data[shuffle_indices]
img_labels=img_labels[shuffle_indices]
#Split the data
X_train, X_test, y_train, y_test=train_test_split(img_data,img_labels, test_size=0.2,random_state=42, shuffle=False)
#Resize data
X_train=X_train/255
X_val=X_val/255
這意味着您可能用完了 RAM 或 GPU memory。
To check on Windows open Task Manager (ctrl+shft+esc), go to performance run the code, and check the RAM usage and GPU memory usage to determine if the cause was either of them.
注意:要監控 GPU memory,您應該監控“專用 GPU 內存”,當您單擊 Z52573329ECCDA373 時,可以在左下方找到
添加到 MK 答案,如果您的 kernel 崩潰的原因確實是由於 RAM/GPU 限制。 您可以嘗試分批加載數據。 與其同時拆分整個數據集,不如嘗試一次拆分四分之一。
請注意,拆分數據后,您基本上保留了相同數據的 2 個實例(原始(img_data, img_labels)
和拆分形式)。 如果您的 memory 用完了,最好的辦法是通過一個索引數組來管理它,您可以根據需要隱式地從中提取批次。
創建洗牌的索引數組,
shuffle_indices = np.random.permutation(img_data.shape[0])
這與一步中的兩條線相同。
拆分對應於訓練和測試集中點的索引:
train_indices, test_indices = train_test_split(shuffle_indices, test_size=0.2,random_state=42, shuffle=False))
然后,迭代批次,
n_train = len(train_indices)
for epoch on range(n_epochs):
# further shuffle the training data for each iteration, if desired
epoch_shuffle = np.random.permutation(n_train)
for i in range(n_train, step=batch_size):
# get data batches
x_batch = img_data[train_indices[epoch_shuffle[i*batch_size : (i+1)*batch_size]]]
y_batch = img_labels[train_indices[epoch_shuffle[i*batch_size : (i+1)*batch_size]]]
# train model
...
我使用的時候遇到了同樣的問題
從 sklearnex 導入 patch_sklearn patch_sklearn()
它總是會在代碼中的隨機點崩潰,尤其是在 train_test_split 之后。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.