简体   繁体   English

sklearn预处理培训清单

[英]Preprocessing training list with sklearn

I have mnist training list in the following form: 我有以下形式的mnist培训列表:

def load_data():
    f = gzip.open('mnist.pkl.gz', 'rb')
    training_data, validation_data, test_data = cPickle.load(f, encoding='latin1')
    f.close()
def load_data_wrapper():
    tr_d, va_d, te_d = load_data()
    training_inputs = [np.reshape(x, (784, 1)) for x in tr_d[0]]
    training_results = [vectorized_result(y) for y in tr_d[1]]
    training_data = list(zip(training_inputs, training_results))
    ........................................

Now I would like to preprocess my training inputs to have zero mean and unit variance. 现在,我想对我的训练输入进行预处理,使其均值和单位方差为零。 So I used from sklearn import preprocessing in the following: 因此,我在以下步骤中使用from sklearn import preprocessing

def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):

        if test_data: n_test = len(test_data)
        preprocessed_training = preprocessing.scale(training_data)
        n = len(preprocessed_training)
        for j in range(epochs):
            random.shuffle(preprocessed_training)
            mini_batches = [
                training_data[k:k+mini_batch_size].....
                ....................

However, I'm getting the following error: 但是,出现以下错误:

ValueError: setting an array element with a sequence.

I'm modifying code from mnielsen that can be found here . 我正在修改mnielsen的代码,可以在此处找到。 I'm new in python and machine learning in general. 我是python和机器学习的新手。 I would appreciate if anyone can help me out. 如果有人可以帮助我,我将不胜感激。 Note: If you think there is a better library option then please let me know as well. 注意:如果您认为有更好的库选项,请也告诉我。

Update_1 : This was my another try which gives the same error. Update_1 :这是我的另一个尝试,给出了同样的错误。

    scaler = StandardScaler()
    scaler.fit(training_data)
    training_data = scaler.transform(training_data)
    if test_data: test_data = scaler.transform(test_data)

Update_2 : I tried the solution provided in the suggested answer using pandas dataframe but I am still getting the same error. Update_2 :我尝试使用pandas数据尝试提供建议的答案中提供的解决方案,但仍然遇到相同的错误。

Update_3 : So it's object type but I need float type to perform scaler. Update_3 :所以它是对象类型,但是我需要浮点类型来执行缩放器。 I did the following: training_data = np.asarray(training_data).astype(np.float64) and I still get the error! 我做了以下工作: training_data = np.asarray(training_data).astype(np.float64) ,我仍然收到错误!

Update_4 : General mnist dataset structure: 50k training images, 10k test images. Update_4 :通用mnist数据集结构:50k训练图像,10k测试图像。 In 50k images, each image is 28 * 28 pixels , which gives 784 data points. 在50k图像中,每个图像为28 * 28像素,给出784个数据点。 For example, a data point in MNIST, if it's original output is 5 then it's ( array([ 0., 0., 0., ..., 0., 0., 0.], dtype=float32), 5) tuple.You can see that first element in the tuple is a sparse matrix. 例如,MNIST中的数据点,如果原始输出为5,则为( array([ 0., 0., 0., ..., 0., 0., 0.], dtype=float32), 5)元组。您可以看到元组中的第一个元素是稀疏矩阵。 Here is an example of the training dataset, first element of the tuple (ie the input image with 784 greyscaled floats). 是训练数据集的示例,它是元组的第一个元素(即输入图像具有784灰度浮点数)。 Along second element of the tuple, we just give output as a number 0 through 9. However, in one hot encoding, we give a 10D vector where all index values are zeros except for the index of the output value. 沿着元组的第二个元素,我们仅将输出指定为0到9之间的数字。但是,在一种热编码中,我们给出了10D向量,其中除输出值的索引外,所有索引值均为零。 So for number 5 it will be [[0],[0],[0],[0],[0],[1],[0],[0],[0],[0]] . 因此,对于数字5,它将为[[0],[0],[0],[0],[0],[1],[0],[0],[0],[0]] The wrapper modification that I'm using can be found here . 我正在使用的包装器修改可以在这里找到。

I do this in a bit of a different way. 我用不同的方式来做。 Recall that you must scale your training and testing sets by the same function which is built from all your training data. 回想一下,您必须通过根据所有训练数据构建的相同功能来扩展训练和测试集。 Also, you only want to manipulate your features. 另外,您只想操纵功能。 I would start by converting to a train and test dataframe and a list of features . 我将从转换为训练测试数据框以及功能列表开始。

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train[features])   
X_train = pd.DataFrame(scaler.transform(train[features]),columns = train[features].columns)
X_test = pd.DataFrame(scaler.transform(test[features]),columns = test[features].columns)

Does this work? 这样行吗? Is there a reason you need to use batches? 您有使用批处理的理由吗?

The problem I was having is because of the fact that from sklearn.preprocessing import StandardScaler changes dimension of my data. 我遇到的问题是由于from sklearn.preprocessing import StandardScaler更改了数据的维度。 Instead of using StandardScaler , I just used preprocessing.scale for each input in my (50k,(784,1)) dim dataset. 我没有使用StandardScaler ,而是对(50k,(784,1))暗淡数据集中的每个输入使用了preprocessing.scale That is, I used the scale function to each (784,1) data on axis = 1 and added them using a for loop. 也就是说,我对axis = 1上的每个(784,1)数据使用了scale函数,并使用for循环将它们相加。 This slowed down the program but worked. 这减慢了程序的运行速度,但是起作用了。 If anyone knows a better way please let me know in the answer section. 如果有人知道更好的方法,请在答案部分中告诉我。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM