Predicting new data using sklearn after standardizing the training data

Question

I am using Sklearn to build a linear regression model (or any other model) with the following steps:

X_train and Y_train are the training data

Standardize the training data
```
 X_train = preprocessing.scale(X_train)
```
fit the model
```
 model.fit(X_train, Y_train)
```

Once the model is fit with scaled data, how can I predict with new data (either one or more data points at a time) using the fit model?

What I am using is

Scale the data

NewData_Scaled = preprocessing.scale(NewData)

Predict the data

PredictedTarget = model.predict(NewData_Scaled)

I think I am missing a transformation function with preprocessing.scale so that I can save it with the trained model and then apply it on the new unseen data? any help please.

Answer 1

Take a look at these docs .

You can use the StandardScaler class of the preprocessing module to remember the scaling of your training data so you can apply it to future values.

from sklearn.preprocessing import StandardScaler
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)

scaler has calculated the mean and scaling factor to standardize each feature.

>>>scaler.mean_
array([ 1. ...,  0. ...,  0.33...])
>>>scaler.scale_                                       
array([ 0.81...,  0.81...,  1.24...])

To apply it to a dataset:

import numpy as np

X_train_scaled = scaler.transform(X_train)
new_data = np.array([-1.,  1., 0.])    
new_data_scaled = scaler.transform(new_data)
>>>new_data_scaled
array([[-2.44...,  1.22..., -0.26...]])

Answer 2

Above answer is OK when you have use train data and test data in single run...
But what if you want to test or infer after training

This will surely help

from sklearn.preprocessing import StandardScaler
import numpy as np
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data 

sc = StandardScaler()
sc.fit(X)
x = sc.transform(X)
#On new data, though data count is one but Features count is still Four
sc.transform(np.array([[6.5, 1.5, 2.5, 6.5]]))  # to verify the last returned output



std  = np.sqrt(sc.var_)
np.save('std.npy',std )
np.save('mean.npy',sc.mean_)

This block is independent

s = np.load('std.npy')
m = np.load('mean.npy')
(np.array([[6.5, 1.5, 2.5, 6.5]] - m)) / s   # z = (x - u) / s ---> Main formula
# will have same output as above

Predicting new data using sklearn after standardizing the training data

Question

2 answers

solution1
29 ACCPTED 2016-08-05 06:44:34

solution2
1 2021-06-09 09:07:57

This block is independent

Predicting new data using sklearn after standardizing the training data

Question

2 answers

solution1 29 ACCPTED 2016-08-05 06:44:34

solution2 1 2021-06-09 09:07:57

This block is independent

solution1
29 ACCPTED 2016-08-05 06:44:34

solution2
1 2021-06-09 09:07:57