简体   繁体   English

如何存储缩放参数以备后用

[英]How to store scaling parameters for later use

I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.我想应用scikit-learn提供的缩放sklearn.preprocessing.scale模块来集中我将用来训练 svm 分类器的数据集。

How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?然后我如何存储标准化参数,以便我也可以将它们应用于我想要分类的数据?

I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?我知道我可以使用标准standarScaler但是我可以以某种方式将它序列化到一个文件中,这样每次我想运行分类器时standarScaler它与我的数据相匹配吗?

I think that the best way is to pickle it post fit , as this is the most generic option.我认为最好的方法是在fit后腌制,因为这是最通用的选择。 Perhaps you'll later create a pipeline composed of both a feature extractor and scaler.也许您稍后会创建一个由特征提取器和缩放器组成的管道。 By pickling a (possibly compound) stage, you're making things more generic.通过酸洗(可能是复合的)阶段,您可以使事情变得更加通用。 The sklearn documentation on model persistence discusses how to do this. 关于模型持久性sklearn 文档讨论了如何做到这一点。

Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:话虽如此,您可以查询sklearn.preprocessing.StandardScaler以获取拟合参数:

scale_ : ndarray, shape (n_features,) Per feature relative scaling of the data. scale_ : ndarray, shape (n_features,) 每个特征数据的相对缩放。 New in version 0.17: scale_ is recommended instead of deprecated std_. 0.17 新版功能:推荐使用 scale_ 而不是不推荐使用的 std_。 mean_ : array of floats with shape [n_features] The mean value for each feature in the training set. mean_ :形状为 [n_features] 的浮点数组,训练集中每个特征的平均值。

The following short snippet illustrates this:以下简短片段说明了这一点:

from sklearn import preprocessing
import numpy as np

s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))

Scale with standard scaler使用标准缩放器缩放

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)

save mean_ and var_ for later use保存 mean_ 和 var_ 以备后用

means = scaler.mean_ 
vars = scaler.var_    

(you can print and copy paste means and vars or save to disk with np.save....) (您可以打印和复制粘贴方式和变量或使用 np.save 保存到磁盘....)

Later use of saved parameters以后使用保存的参数

def scale_data(array,means=means,stds=vars **0.5):
    return (array-means)/stds

scale_new_data = scale_data(new_data)

Pickling is usually a bad idea, at least in production ( https://github.com/numpy/numpy/blob/b88b2c0c19851810d4ee07f03a7734b6e19dbdaa/numpy/lib/npyio.py#L472 ), so I am using another approach:酸洗通常是一个坏主意,至少在生产中( https://github.com/numpy/numpy/blob/b88b2c0c19851810d4ee07f03a7734b6e19dbdaa/numpy/lib/npyio.py#L472 ),所以我使用另一种方法:

# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)

#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])

scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])

You can use the joblib module to store the parameters of your scaler.您可以使用 joblib 模块来存储缩放器的参数。

from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')

Later you can load the scaler.稍后您可以加载定标器。

from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM