I want to apply the scaling sklearn.preprocessing.scale
module that scikit-learn
offers for centering a dataset that I will use to train an svm classifier.
How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?
I know I can use the standarScaler
but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?
I think that the best way is to pickle it post fit
, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.
Having said that, you can query sklearn.preprocessing.StandardScaler
for the fit parameters:
scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. New in version 0.17: scale_ is recommended instead of deprecated std_. mean_ : array of floats with shape [n_features] The mean value for each feature in the training set.
The following short snippet illustrates this:
from sklearn import preprocessing
import numpy as np
s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))
Scale with standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
save mean_ and var_ for later use
means = scaler.mean_
vars = scaler.var_
(you can print and copy paste means and vars or save to disk with np.save....)
Later use of saved parameters
def scale_data(array,means=means,stds=vars **0.5):
return (array-means)/stds
scale_new_data = scale_data(new_data)
Pickling is usually a bad idea, at least in production ( https://github.com/numpy/numpy/blob/b88b2c0c19851810d4ee07f03a7734b6e19dbdaa/numpy/lib/npyio.py#L472 ), so I am using another approach:
# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)
#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])
scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])
You can use the joblib module to store the parameters of your scaler.
from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')
Later you can load the scaler.
from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.