PyTorch Dataset and Conv1d using a ton of memory

Question

I am trying to write a convolutional neural network in pytorch. I'm very new to machine learning and PyTorch, so I'm not very familiar with the package.

I have written a custom dataset and it has loaded my data from a csv file properly. However, when I load it into a data loader, my system monitor shows python using a huge amount of memory. I'm currently using only a fraction of my data set, and one instance uses about 5 gigs as a data loader.

My dataset is 1 dimensional tensors. Each one is very long - about 33 million values. I used sys.getsizeof(train_set.sample_list[0][0].storage()) to check the size of the underlying tensor, and it was only 271 megabytes.

Additionally, if I continue on and create an instance of my CNN, the initializer eats up memory until my kernel crashes. The reason for this is unclear to me.

Here is the code for my custom Dataset:


    def __init__(self, csv_file, train):

        self.train = train
        self.df_tmp = pd.read_csv(csv_file, header=None, sep='\t')
        self.df_tmp.drop(self.df_tmp.shape[1]-1, axis=1, inplace=True)
        self.df = self.df_tmp.transpose()
        self.sample_list = []

        for i in range(self.df.shape[0]): #num rows,  33 million ish 
            sample = torch.tensor([self.df.iloc[i][1:].values])
            label = torch.tensor(self.df.iloc[i][0])
            self.sample_list.append((sample, label))

    def __len__(self):
        return len(self.sample_list)

    def __getitem__(self, idx):
        return self.sample_list[idx]

And the code for the NN:


    #input batch shape is (9 x 33889258 x 1)
    def __init__(self):
        super(CNN, self).__init__()

        #input channels 1, output 3
        self.conv1 = torch.nn.Conv1d(1, out_channels=3, kernel_size=(100), stride=10, padding=1)

        #size in is 3,1,33889258
        self.pool = torch.nn.MaxPool1d(kernel_size=2, stride=2, padding=0)

        self.fc1 = torch.nn.Linear(45750366, 1000) #3 * 1 * 3388917

        self.fc2 = torch.nn.Linear(1000, 2)


    def forward(self, x):
        #size: (1x1x33889258) to (3x1x33889258)
        tmp = self.conv1(x.float())
        x = F.relu(tmp) #


        x = self.pool(x)
        #whatever shape comes out of here needs to go into x.view

        x = x.view(45750366) #-1, 1*1*3388927
        x = self.fc1(x)
        x = F.relu(x)

        x = self.fc2(x)
        return(x)

Some of my input sizes might be off, I'm still working that out but the memory issue is preventing me from making progress

Answer 1

You are storing all datapoints in list(ie in memory) so it kinda deafeats the purpose of custom dataset/dataloader. you should just keep the reference of dataframe in your dataset class and for each index return the correct data something like

def __init__(self, csv_file, train):

    self.train = train
    self.df_tmp = pd.read_csv(csv_file, header=None, sep='\t')
    self.df_tmp.drop(self.df_tmp.shape[1]-1, axis=1, inplace=True)
    self.df = self.df_tmp.transpose()

def __len__(self):
    return self.df.shape[0]

def __getitem__(self, idx):
    sample = torch.tensor([self.df.iloc[idx][1:].values])
    label = torch.tensor(self.df.iloc[idx][0])
    return sample, label

one small note: you are returning tensors from dataset's getitem method returning pure numpy array is prefered and easier because the dataloader will convert it into a pytorch tensor.

PyTorch Dataset and Conv1d using a ton of memory

Question

1 answers

solution1
2 2020-04-27 06:06:59

PyTorch Dataset and Conv1d using a ton of memory

Question

1 answers

solution1 2 2020-04-27 06:06:59

solution1
2 2020-04-27 06:06:59