使用scipy.stats和statsmodels计算线性回归时的结果不同

Question

I'm getting different values of r^2 (coefficient of determination) when I try OLS fits with these two libraries and I can't quite figure out why. 当我尝试将OLS与这两个库拟合时，我得到了r ^ 2（确定系数）的不同值，但我不太清楚为什么。 (Some spacing removed for your convenience) （为方便起见，删除了一些间距）

In [1]: import pandas as pd       
In [2]: import numpy as np
In [3]: import statsmodels.api as sm
In [4]: import scipy.stats
In [5]: np.random.seed(100)
In [6]: x = np.linspace(0, 10, 100) + 5*np.random.randn(100)
In [7]: y = np.arange(100)

In [8]: slope, intercept, r, p, std_err = scipy.stats.linregress(x, y)

In [9]: r**2
Out[9]: 0.22045988449873671

In [10]: model = sm.OLS(y, x)
In [11]: est = model.fit()

In [12]: est.rsquared
Out[12]: 0.5327910685035413

What is going on here? 这里发生了什么？ I can't figure it out! 我想不通！ Is there an error somewhere? 哪里有错误？

Answer 1

This is not an answer to the original question which has been answered. 这不是已回答的原始问题的答案。

About R-squared in a regression without a constant. 在无常数的回归中关于R平方。

One problem is that a regression without an intercept doesn't have the standard definition of R^2. 一个问题是没有截距的回归不具有R ^ 2的标准定义。

Essentially, R-squared as a goodness of fit measure in a model with an intercept compares the full model with the model that has only an intercept. 本质上，R平方作为具有截距的模型中的拟合优度，将完整模型与仅具有截距的模型进行比较。 If the full model does not have an intercept, then the standard definition of R^2 can produce weird results like negative R^2. 如果完整模型没有截距，则R ^ 2的标准定义会产生奇怪的结果，例如负R ^ 2。

The conventional definition in the regression without constant divides by the total sum of squares of the dependent variable instead of the demeaned. 没有常数的回归中的常规定义是除以因变量的平方和而不是除法的。 The R^2 between a regression with a constant and without cannot really be compared in a meaningful way. 不能以有意义的方式真正比较具有常数和没有常数的回归之间的R ^ 2。

see for example the issue that triggered the change in statsmodels to handle R^2 "correctly" in the no-constant regression: https://github.com/statsmodels/statsmodels/issues/785 例如，在非常数回归中查看触发statsmodels更改以“正确”处理R ^ 2的问题： https : //github.com/statsmodels/statsmodels/issues/785

Answer 2

The 0.2205 is coming from a model which also has an intercept term--the 0.5328 value is the result if you remove the intercept. 0.2205来自也有截距项的模型-如果删除截距，则结果为0.5328。

Basically, one package is modeling y = bx whereas the other (helpfully) assumes that you would also like an intercept term (ie y = a + bx ). 基本上，一个程序包正在建模y = bx，而另一个程序包（有帮助地）假设您还希望使用拦截项（即y = a + bx ）。 [Note: The advantage of this assumption is that otherwise you would have to take x and bind a column of ones to it every time you wanted to run a regression (or else you'd end up with a biased model)] [注意：此假设的优点是，否则每次您要运行回归时，您都必须采用x并将一列的列绑定到x上（否则最终将产生有偏差的模型）]

Check out this post for a longer discussion. 查看这篇文章，进行更长时间的讨论。

Good luck! 祝好运！

使用scipy.stats和statsmodels计算线性回归时的结果不同

问题描述

2 个解决方案

解决方案1
3 2014-06-03 03:24:51

解决方案2
1 已采纳 2014-06-03 00:44:43

使用scipy.stats和statsmodels计算线性回归时的结果不同

问题描述

2 个解决方案

解决方案1 3 2014-06-03 03:24:51

解决方案2 1 已采纳 2014-06-03 00:44:43

解决方案1
3 2014-06-03 03:24:51

解决方案2
1 已采纳 2014-06-03 00:44:43