简体   繁体   中英

Scikit-learn: Predicting new raw and unscaled instance using models trained with scaled data

I have produced different classifier models using scikit-learn and this has been smooth-sailing. Due to differences of units in the data (I got the data from different sensors labeled by their corresponding categories), I opted to scale the features using the StandardScale module.

Resulting accuracy scores of the different machine learning classifiers were fine. However, when I try to use the model to predict a raw instance (meaning unscaled) of sensor values, the models output wrong classification.

Should this really be the case because of the scaling done to the training data? If so, is there an easy way to scale the raw values too? I would like to use model persistence for this using joblib and it would be appreciated if there is a way to make this as modular as possible. Meaning to say, not to record mean and standard variation for each feature every time the training data changes.

Should this really be the case because of the scaling done to the training data?

Yes, this is expected behavior. You trained your model on scaled data, thus it will only work with scaled data.

If so, is there an easy way to scale the raw values too?

Yes, just save your scaler.

# Training
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
...
# do some training, probably save classifier, and save scaler too!

then

# Testing
# load scaler
scaled_instances = scaler.transform(raw_instances)

Meaning to say, not to record mean and standard variation for each feature every time the training data changes

This is exactly what you have to do, although not by hand (as this is what scaler computes), but essentialy "under the hood" this is what happens - you have to store means/stds for each feature.

I've been struggling with this problem for days and googling a lot and finally, thanks to lejlot's posting, I solved the problem about what you exactly mentioned.

I was very annoyed that nobody wrote the means of how to predict an arbitrary number after standardizing X (objective function) (BTW, you should not standardize y. I was confused at first, coz everybody seemed confused and wrote wrongly.)

I'll put a code you can easily refer to below.

from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import StandardScaler
from numpy import *



X = array([ [0], [1],[2],[3],[4],[5],[6],[7] ])  
y = 2*array([ [0], [1],[2],[3],[4],[5],[6],[7]  ])


scaler = StandardScaler()
X_train = scaler.fit_transform(X)
print(X_train)




model = MLPRegressor(hidden_layer_sizes=(3 ), activation='logistic', solver='lbfgs', alpha=0.0001, batch_size ="auto",
                    learning_rate= 'constant', learning_rate_init=0.001, power_t=0.5, max_iter=2000, shuffle=True, random_state=None,
                    tol=0.0001, verbose=True, warm_start=False, momentum=0.9, nesterovs_momentum=True, early_stopping=False,
                    validation_fraction=0.1, beta_1=0.9, beta_2=0.999, epsilon=1e-08)



model.fit(X_train, y) 
# YOU CAN SEE I DIDN't STANDARDIZE y BUT ONLY X.


# Testing
# load scaler

scaled_instances = scaler.transform(array([ [1],[2] ]))
print(scaled_instances)

s = model.predict( scaled_instances )
print(s)

I tested several numbers and showed the correct values. And it was very helpful information from lejlot's posting that while training 'scaler' is saved. <- I absolutely had no idea about it.

Thanks to this feature, whatever number we use to "predict", this saved scaler scales the input number which we're going to use to predict unknown output.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM