简体   繁体   中英

Preprocessing training list with sklearn

I have mnist training list in the following form:

def load_data():
    f = gzip.open('mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
    f.close()
def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = list(zip(training_inputs, training_results))
    ........................................

Now I would like to preprocess my training inputs to have zero mean and unit variance. So I used from sklearn import preprocessing in the following:

def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):

        if test_data: n_test = len(test_data)
        preprocessed_training = preprocessing.scale(training_data)
        n = len(preprocessed_training)
        for j in range(epochs):
            random.shuffle(preprocessed_training)
            mini_batches = [
                training_data[k:k+mini_batch_size].....
                ....................

However, I'm getting the following error:

ValueError: setting an array element with a sequence.

I'm modifying code from mnielsen that can be found here . I'm new in python and machine learning in general. I would appreciate if anyone can help me out. Note: If you think there is a better library option then please let me know as well.

Update_1 : This was my another try which gives the same error.

    scaler = StandardScaler()
    scaler.fit(training_data)
    training_data = scaler.transform(training_data)
    if test_data: test_data = scaler.transform(test_data)

Update_2 : I tried the solution provided in the suggested answer using pandas dataframe but I am still getting the same error.

Update_3 : So it's object type but I need float type to perform scaler. I did the following: training_data = np.asarray(training_data).astype(np.float64) and I still get the error!

Update_4 : General mnist dataset structure: 50k training images, 10k test images. In 50k images, each image is 28 * 28 pixels , which gives 784 data points. For example, a data point in MNIST, if it's original output is 5 then it's ( array([ 0., 0., 0., ..., 0., 0., 0.], dtype=float32), 5) tuple.You can see that first element in the tuple is a sparse matrix. Here is an example of the training dataset, first element of the tuple (ie the input image with 784 greyscaled floats). Along second element of the tuple, we just give output as a number 0 through 9. However, in one hot encoding, we give a 10D vector where all index values are zeros except for the index of the output value. So for number 5 it will be [[0],[0],[0],[0],[0],[1],[0],[0],[0],[0]] . The wrapper modification that I'm using can be found here .

I do this in a bit of a different way. Recall that you must scale your training and testing sets by the same function which is built from all your training data. Also, you only want to manipulate your features. I would start by converting to a train and test dataframe and a list of features .

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train[features])   
X_train = pd.DataFrame(scaler.transform(train[features]),columns = train[features].columns)
X_test = pd.DataFrame(scaler.transform(test[features]),columns = test[features].columns)

Does this work? Is there a reason you need to use batches?

The problem I was having is because of the fact that from sklearn.preprocessing import StandardScaler changes dimension of my data. Instead of using StandardScaler , I just used preprocessing.scale for each input in my (50k,(784,1)) dim dataset. That is, I used the scale function to each (784,1) data on axis = 1 and added them using a for loop. This slowed down the program but worked. If anyone knows a better way please let me know in the answer section.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM