为什么每次我在这个特定的数据集上运行训练测试拆分时，我的 kernel 都会死掉？

Question

I've used train-test split before and haven't had any issues.我以前使用过训练测试拆分并且没有任何问题。 I have a rather large (1GB) dataset for my CNN and tried using it, and my kernel dies every time.我的 CNN 有一个相当大的 (1GB) 数据集并尝试使用它，但我的 kernel 每次都死机。 I've read that sometimes it helps to enter shuffle=False .我读过有时输入shuffle=False会有所帮助。 I tried that with no luck.我试过了，但没有运气。 I've included my code below.我在下面包含了我的代码。 Any help would be appreciated!!任何帮助，将不胜感激！！

import pandas as pd
import os
import cv2
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from PIL import Image
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.image import ImageDataGenerator
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score
np.random.seed(42)
data_dir='birds/'
train_path=data_dir+'/train'
test_path=data_dir+'/test'
img_size=(100,100)
channels=3
num_categories=len(os.listdir(train_path))
#get list of each category to zip
names_of_species=[]

for i in os.listdir(train_path):
    names_of_species.append(i)

#make list of numbers from 1-300:
num_list=[]
for i in range(300):
    num_list.append(i)
nums_and_names=dict(zip(num_list, names_of_species))
folders=os.listdir(train_path)
import random
from matplotlib.image import imread
df=pd.read_csv(data_dir+'/Bird_Species.csv')

img_data=[]
img_labels=[]

for i in nums_and_names:
    path=data_dir+'train/'+str(names_of_species[i])
    images=os.listdir(path)
    
    for img in images:
        try:
            image=cv2.imread(path+'/'+img)
            image_fromarray=Image.fromarray(image, 'RGB')
            resize_image=image_fromarray.resize((img_size))
            img_data.append(np.array(resize_image))
            img_labels.append(num_list[i])
        except:
            print("Error in "+img)
img_data=np.array(img_data)
img_labels=np.array(img_labels)
img_labels
array([210,  41, 148, ...,  15, 115, 292])
#SHUFFLE TRAINING DATA

shuffle_indices=np.arange(img_data.shape[0])
np.random.shuffle(shuffle_indices)
img_data=img_data[shuffle_indices]
img_labels=img_labels[shuffle_indices]
#Split the data

X_train, X_test, y_train, y_test=train_test_split(img_data,img_labels, test_size=0.2,random_state=42, shuffle=False)

#Resize data
X_train=X_train/255
X_val=X_val/255

Answer 1

This means that you are probably running out of RAM or GPU memory.这意味着您可能用完了 RAM 或 GPU memory。

To check on Windows open Task Manager (ctrl+shft+esc), go to performance run the code, and check the RAM usage and GPU memory usage to determine if the cause was either of them. To check on Windows open Task Manager (ctrl+shft+esc), go to performance run the code, and check the RAM usage and GPU memory usage to determine if the cause was either of them.

Note: To monitor GPU memory you should monitor "Dedicated GPU Memory", which can be found on the bottom left when you click on GPU.注意：要监控 GPU memory，您应该监控“专用 GPU 内存”，当您单击 Z52573329ECCDA373 时，可以在左下方找到

Answer 2

Adding to MK answer, if the cause of your kernel crash is indeed due to RAM/GPU limit.添加到 MK 答案，如果您的 kernel 崩溃的原因确实是由于 RAM/GPU 限制。 You could try to load your data in batches.您可以尝试分批加载数据。 Instead of splitting the entire datasets at the same time, try to divide maybe a quarter at a time.与其同时拆分整个数据集，不如尝试一次拆分四分之一。

Answer 3

Notice that after splitting the data you are basically keeping 2 instances of the same data (the original (img_data, img_labels) and in split form).请注意，拆分数据后，您基本上保留了相同数据的 2 个实例（原始(img_data, img_labels)和拆分形式）。 If you are running out of memory, the best is to manage it via an index array from which you implicitly pull batches as you need them.如果您的 memory 用完了，最好的办法是通过一个索引数组来管理它，您可以根据需要隐式地从中提取批次。

Create shuffled array of indices,创建洗牌的索引数组，

shuffle_indices = np.random.permutation(img_data.shape[0])

which does the same as your two lines in one step.这与一步中的两条线相同。

Split the indices corresponding to points in the train and test sets:拆分对应于训练和测试集中点的索引：

train_indices, test_indices = train_test_split(shuffle_indices, test_size=0.2,random_state=42, shuffle=False))

Then, iterate on batches,然后，迭代批次，

n_train = len(train_indices)
for epoch on range(n_epochs):
    # further shuffle the training data for each iteration, if desired
    epoch_shuffle = np.random.permutation(n_train)

    for i in range(n_train, step=batch_size):
        # get data batches
        x_batch = img_data[train_indices[epoch_shuffle[i*batch_size : (i+1)*batch_size]]]
        y_batch = img_labels[train_indices[epoch_shuffle[i*batch_size : (i+1)*batch_size]]]

        # train model
        ...

Answer 4

I had the same problem when I used我使用的时候遇到了同样的问题

from sklearnex import patch_sklearn patch_sklearn()从 sklearnex 导入 patch_sklearn patch_sklearn()

It would always crash at random points in the code especially after a train_test_split.它总是会在代码中的随机点崩溃，尤其是在 train_test_split 之后。

为什么每次我在这个特定的数据集上运行训练测试拆分时，我的 kernel 都会死掉？

问题描述

3 个解决方案

解决方案1
2 2021-09-30 16:41:04

解决方案2
0 2021-09-30 18:04:48

解决方案3
0 2021-09-30 19:08:40

解决方案4
0 2022-08-30 22:39:49

为什么每次我在这个特定的数据集上运行训练测试拆分时，我的 kernel 都会死掉？

问题描述

3 个解决方案

解决方案1 2 2021-09-30 16:41:04

解决方案2 0 2021-09-30 18:04:48

解决方案3 0 2021-09-30 19:08:40

解决方案4 0 2022-08-30 22:39:49

解决方案1
2 2021-09-30 16:41:04

解决方案2
0 2021-09-30 18:04:48

解决方案3
0 2021-09-30 19:08:40

解决方案4
0 2022-08-30 22:39:49