使用Iris数据集重现LASSO / Logistic回归导致R与Python

Question

我试图在Python中重现以下R结果。 在这种特殊情况下，R预测技能低于Python技能，但在我的经验中通常不是这种情况（因此想要在Python中重现结果的原因），所以请在此处忽略该细节。

目的是预测花种（'versicolor'0或'virginica'1）。 我们有100个标记样本，每个样本由4个花特征组成：萼片长度，萼片宽度，花瓣长度，花瓣宽度。 我将数据分为训练（60％的数据）和测试集（40％的数据）。 将10倍交叉验证应用于训练集以搜索最佳λ（在scikit-learn中优化的参数是“C”）。

我在R中使用glmnet ，alpha设置为1（对于LASSO惩罚），对于python，scikit-learn的LogisticRegressionCV函数与“liblinear”解算器（唯一可用于L1惩罚的求解器）。 交叉验证中使用的评分指标在两种语言之间是相同的。 然而，不知何故，模型结果是不同的（每个特征的截距和系数变化相当大）。

R代码

library(glmnet)
library(datasets)
data(iris)

y <- as.numeric(iris[,5])
X <- iris[y!=1, 1:4]
y <- y[y!=1]-2

n_sample = NROW(X)

w = .6
X_train = X[0:(w * n_sample),]  # (60, 4)
y_train = y[0:(w * n_sample)]   # (60,)
X_test = X[((w * n_sample)+1):n_sample,]  # (40, 4)
y_test = y[((w * n_sample)+1):n_sample]   # (40,)

# set alpha=1 for LASSO and alpha=0 for ridge regression
# use class for logistic regression
set.seed(0)
model_lambda <- cv.glmnet(as.matrix(X_train), as.factor(y_train),
                        nfolds = 10, alpha=1, family="binomial", type.measure="class")

best_s  <- model_lambda$lambda.1se
pred <- as.numeric(predict(model_lambda, newx=as.matrix(X_test), type="class" , s=best_s))

# best lambda
print(best_s)
# 0.04136537

# fraction correct
print(sum(y_test==pred)/NROW(pred))   
# 0.75

# model coefficients
print(coef(model_lambda, s=best_s))
#(Intercept)  -14.680479
#Sepal.Length   0        
#Sepal.Width   0
#Petal.Length   1.181747
#Petal.Width    4.592025

Python代码

from sklearn import datasets
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

n_sample = len(X)

w = .6
X_train = X[:int(w * n_sample)]  # (60, 4)
y_train = y[:int(w * n_sample)]  # (60,)
X_test = X[int(w * n_sample):]  # (40, 4)
y_test = y[int(w * n_sample):]  # (40,)

X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = ‘accuracy’, random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.775
print clf.intercept_  #-1.83569557
print clf.coef_  # [ 0,  0, 0.65930981, 1.17808155] (sepal length, sepal width, petal length, petal width)
print clf.C_  # optimal lambda: 0.35938137

Answer 1

以上示例中有一些不同之处：

系数的比例
- glmnet（ https://cran.r-project.org/web/packages/glmnet/glmnet.pdf ）标准化数据并且“系数总是以原始比例返回”。 因此，在调用glmnet之前，您没有扩展数据。
- Python代码标准化数据，然后适合标准化数据。 在这种情况下，coefs是标准化的规模，而不是原始规模。 这使得示例之间的系数不可比较。
LogisticRegressionCV默认使用分层折叠。 glmnet使用k-fold。
它们适合不同的方程式。 请注意，scikit-learn logistic（ http://scikit-learn.org/stable/modules/linear_model.html#logistic-regression ）与后勤方面的正规化相符合。 glmnet将正则化置于惩罚之上。
选择正规化强度试试 - glmnet默认为100 lambdas试试。 scikit LogisticRegressionCV默认为10.由于scikit求解方程，范围介于1e-4和1e4之间（ http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegressionCV.html#sklearn.linear_model .LogisticRegressionCV ）。
宽容是不同的。 在我遇到的一些问题中，收紧公差显着改变了系数。
- glmnet默认阈值为1e-7
- LogisticRegressionCV默认TOL到1E-4
- 即使在使它们相同之后，它们也可能无法衡量同样的事情。 我不知道什么是liblinear措施。 glmnet - “每个内部坐标下降循环一直持续到任何系数更新后物镜的最大变化小于零偏差的阈值。”

您可能想要尝试打印正则化路径以查看它们是否非常相似，只是停止在不同的强度上。 然后你可以研究为什么。

即使改变了你可以改变的东西，但不是以上所有，你可能得不到相同的系数或结果。 虽然您在不同的软件中解决了同样的问题，但软件如何解决问题可能会有所不同。 我们看到不同的尺度，不同的方程，不同的默认值，不同的求解器等。

Answer 2

你在这里遇到的问题是数据集的排序（注意我没有检查过R代码，但我确定这是问题）。 如果我运行你的代码，然后运行它

print np.bincount(y_train) # [50 10]
print np.bincount(y_test) # [ 0 40]

您可以看到训练集不代表测试集。 但是，如果我对Python代码进行了一些更改，那么我的测试精度为0.9 。

from sklearn import datasets
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.linear_model import LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
import numpy as np

iris = datasets.load_iris()
X = iris.data
y = iris.target
X = X[y != 0]  # four features. Disregard one of the 3 species.                                                                                                                 
y = y[y != 0]-1  # two species: 'versicolor' (0), 'virginica' (1). Disregard one of the 3 species.                                                                               

X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, 
                                                                    test_size=0.4,
                                                                    random_state=42,
                                                                    stratify=y)


X_train_fit = StandardScaler().fit(X_train)
X_train_transformed = X_train_fit.transform(X_train)

clf = LogisticRegressionCV(n_jobs=2, penalty='l1', solver='liblinear', cv=10, scoring = 'accuracy', random_state=0)
clf.fit(X_train_transformed, y_train)

print clf.score(X_train_fit.transform(X_test), y_test)  # score is 0.9
print clf.intercept_  #0.
print clf.coef_  # [ 0., 0. ,0., 0.30066888] (sepal length, sepal width, petal length, petal width)
print clf.C_ # [ 0.04641589]

Answer 3

我不得不在这里采取一些措施。

首先，“对于python，scikit-learn的LogisticRegressionCV函数与”liblinear“解算器（唯一可以与L1惩罚一起使用的求解器）”。 这显然是错误的，除非你打算以某种更明确的方式对其进行限定。 只需看一下sklearn.linear_model类的描述，你就会看到一些特别提到L1的内容。 我相信其他人也允许你实现它，但我真的不想算数。

其次，分割数据的方法不太理想。 在拆分后查看输入和输出，您会发现在拆分中，所有测试样本的目标值都是1，而目标1只占训练样本的1/6。 这种不平衡不能代表目标的分布，会导致您的模型不合适。 例如，只用sklearn.model_selection.train_test_split开箱，然后重新安装LogisticRegressionCV分类完全按照你有，导致的艾柯雷.92

现在所说的有一个用于python的glmnet包，你可以使用这个包复制你的结果。 本项目的作者撰写了一篇博客，讨论了尝试使用sklearn重新创建glmnet结果时的一些限制。 特别：

“Scikit-Learn有一些类似于glmnet，ElasticNetCV和LogisticRegressionCV的求解器，但它们有一些局限性。第一个仅适用于线性回归，后者不能处理弹性净惩罚。” - Bill Lattner GLMNET FOR PYTHON

使用Iris数据集重现LASSO / Logistic回归导致R与Python

问题描述

3 个解决方案

解决方案1
4 已采纳 2017-04-24 21:00:19

解决方案2
1 2017-04-24 12:55:37

解决方案3
1 2017-04-24 12:57:44

使用Iris数据集重现LASSO / Logistic回归导致R与Python

问题描述

3 个解决方案

解决方案1 4 已采纳 2017-04-24 21:00:19

解决方案2 1 2017-04-24 12:55:37

解决方案3 1 2017-04-24 12:57:44

解决方案1
4 已采纳 2017-04-24 21:00:19

解决方案2
1 2017-04-24 12:55:37

解决方案3
1 2017-04-24 12:57:44