简体   繁体   English

在scikit-learn中拟合分类器之前进行特征缩放的必要性

[英]The necessity of feature scaling before fitting a classifier in scikit-learn

I used to believe that scikit-learn 's Logistic Regression classifier (as well as SVM ) automatically standardizes my data before training. 我曾经相信scikit-learnLogistic回归分类器(以及SVM )会在训练之前自动标准化我的数据。 The reason I used to believe it is because of the regularization parameter C that is passed to the LogisticRegression constructor: Applying regularization (as I understand it) doesn't make sense without feature scaling. 我之所以相信它的原因是因为传递给LogisticRegression构造函数的正则化参数C :如果没有特征缩放,应用正则化(我理解它)是没有意义的。 For regularization to work properly, all the features should be on comparable scales. 为了使正规化工作正常,所有功能都应该具有可比性。 Therefore, I used to assume that when calling the LogisticRegression.fit(X) on training data X , the fit method first performs feature scaling and then starts training. 因此,我曾经假设在训练数据X上调用LogisticRegression.fit(X)时, fit方法首先执行特征缩放,然后开始训练。 In order to test my assumption I've decided to manually scale the features of X as follows: 为了测试我的假设,我决定手动扩展X的功能,如下所示:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)
X_std = scaler.transform(X)

Then I've initialized a LogisticRegression object with a regularization parameter C : 然后我用正则化参数C初始化了一个LogisticRegression对象:

from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression(C=10.0, random_state=0)

I've found out that training the model on X is not equivalent to training the model on X_std . 我发现在X上训练模型并不等同于在X_std上训练模型。 That is to say, the model produced by 也就是说,由模型制作的模型

log_reg.fit(X_std, y)

is not similar to the model produced by 与产生的模型不相似

log_reg.fit(X, y)

Does that mean that scikit-learn doesn't standardize the features before training? 这是否意味着scikit-learn在培训之前没有标准化功能? Or maybe it does scale but by applying a different procedure? 或者它可以扩展,但通过应用不同的程序? If scikit-learn doesn't perform feature scaling, how is it consistent with requiring the regularization parameter C ? 如果scikit-learn不执行特征缩放,那么它与要求正则化参数C一致性如何? Should I manually standardize my data every time before fitting the model in order for regularization to make sense? 我是否应该在拟合模型之前每次手动标准化我的数据以使正则化有意义?

From the following note in: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html 来自以下注释: http//scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

I'd assume that you need to preprocess the data yourself (eg with a scaler from sklearn.preprocessing.) 我假设您需要自己预处理数据(例如,使用sklearn.preprocessing中的缩放器。)

solver : {'newton-cg', 'lbfgs', 'liblinear', 'sag'} 求解器:{'newton-cg','lbfgs','liblinear','sag'}

Algorithm to use in the optimization problem. 用于优化问题的算法。 For small datasets, 'liblinear' is a good choice, whereas 'sag' is faster for large ones. 对于小数据集,'liblinear'是一个不错的选择,而'sag'对于大数据集来说更快。

For multiclass problems, only 'newton-cg' and 'lbfgs' handle multinomial loss; 对于多类问题,只有'newton-cg'和'lbfgs'处理多项损失; 'sag' and 'liblinear' are limited to one-versus-rest schemes. 'sag'和'liblinear'仅限于一对一休息方案。

'newton-cg', 'lbfgs' and 'sag' only handle L2 penalty. 'newton-cg','lbfgs'和'sag'只处理L2惩罚。

Note that 'sag' fast convergence is only guaranteed on features with approximately the same scale. 请注意,“sag”快速收敛仅在具有大致相同比例的特征上得到保证。 You can preprocess the data with a scaler from sklearn.preprocessing. 您可以使用sklearn.preprocessing中的缩放器预处理数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM