简体   繁体   中英

Randomize the splitting of data for training and testing for this function

I wrote a function to split numpy ndarrays x_data and y_data into training and test data based on a percentage of the total size.

Here is the function:

def split_data_into_training_testing(x_data, y_data, percentage_split):
    number_of_samples = x_data.shape[0]
    p = int(number_of_samples * percentage_split)

    x_train = x_data[0:p]
    y_train = y_data[0:p]

    x_test = x_data[p:]
    y_test = y_data[p:]

    return x_train, y_train, x_test, y_test

In this function, the top part of the data goes to the training dataset and the bottom part of the data samples go to the testing dataset based on percentage_split . How can this data split be made more randomized before being fed to the machine learning model?

Assuming there's a reason you're implementing this yourself instead of using sklearn.train_test_split , you can shuffle an array of indices (this leaves the training data untouched) and index on that.

def split_data_into_training_testing(x_data, y_data, split, shuffle=True):
    idx = np.arange(len(x_data))
    if shuffle:
        np.random.shuffle(idx)

    p = int(len(x_data) * split)
    x_train = x_data[idx[:p]]
    x_test = x_data[idx[p:]]
    ...  # Similarly for y_train and y_test.

    return x_train, x_test, y_train, y_test

You can create a mask with p randomly selected true elements and index the arrays that way. I would create the mask by shuffling an array of the available indices:

ind = np.arange(number_of_samples)
np.random.shuffle(ind)
ind_train = np.sort(ind[:p])
ind_test = np.sort(ind[p:])
x_train = x_data[ind_train]
y_train = y_data[ind_train]
x_test = x_data[ind_test]
y_test = y_data[ind_test]

Sorting the indices is only necessary if your original data is monotonically increasing or decreasing in x and you'd like to keep it that way. Otherwise, ind_train = ind[:p] is just fine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM