I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed by pandas do not match with those from Stata.
Investigation shows that the root cause might be the values of the dependent values are very big. I've this suspicion based on the findings below:
1) I created a simple DataFrame in Python and ran linear regression with it:
from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])
and the summary for the LR is:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>
Number of Observations: 6
Number of Degrees of Freedom: 5
R-squared: 0.4627
Adj R-squared: -1.6865
Rmse: 23.9493
F-stat (4, 1): 0.2153, p-value: 0.9026
Degrees of Freedom: model 4, resid 1
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B 0.3212 1.1176 0.29 0.8218 -1.8693 2.5117
C -0.0488 0.1361 -0.36 0.7806 -0.3155 0.2178
D 0.1512 0.4893 0.31 0.8092 -0.8077 1.1101
E -0.4508 0.8268 -0.55 0.6822 -2.0713 1.1697
intercept 20.9222 23.6280 0.89 0.5386 -25.3887 67.2331
---------------------------------End of Summary---------------------------------
I saved this DataFrame to a Stata .dta file, and ran LR in Stata as:
use "/tmp/lr.dta", clear
reg A B C D E
The result is the same:
Source | SS df MS Number of obs = 6
-------------+------------------------------ F( 4, 1) = 0.22
Model | 493.929019 4 123.482255 Prob > F = 0.9026
Residual | 573.570981 1 573.570981 R-squared = 0.4627
-------------+------------------------------ Adj R-squared = -1.6865
Total | 1067.5 5 213.5 Root MSE = 23.949
------------------------------------------------------------------------------
A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
B | .3211939 1.117591 0.29 0.822 -13.87914 14.52153
C | -.0488429 .1360552 -0.36 0.781 -1.777589 1.679903
D | .1512067 .4892539 0.31 0.809 -6.065353 6.367766
E | -.4508122 .8267897 -0.55 0.682 -10.95617 10.05455
_cons | 20.9222 23.62799 0.89 0.539 -279.2998 321.1442
------------------------------------------------------------------------------
I tried this in R, and got the same result.
2) However, if I increased the values of the dependent variables in Python:
df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)
I've made sure all the columns are using float64 here: df.dtypes A float64 B float64 C float64 D float64 E float64 dtype: object
The result I got is:
-------------------------Summary of Regression Analysis-------------------------
Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>
Number of Observations: 6
Number of Degrees of Freedom: 2
R-squared: -0.7223
Adj R-squared: -1.1528
Rmse: 21.4390
F-stat (4, 4): -1.6775, p-value: 1.0000
Degrees of Freedom: model 1, resid 4
-----------------------Summary of Estimated Coefficients------------------------
Variable Coef Std Err t-stat p-value CI 2.5% CI 97.5%
--------------------------------------------------------------------------------
B -0.0000 0.0000 -0.00 0.9973 -0.0000 0.0000
C 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000
D 0.0000 0.0000 0.00 1.0000 -0.0000 0.0000
E 0.0000 0.0000 0.00 0.9975 -0.0000 0.0000
intercept 0.0000 21.7485 0.00 1.0000 -42.6271 42.6271
---------------------------------End of Summary---------------------------------
But in Stata, I got a very different result:
Source | SS df MS Number of obs = 6
-------------+------------------------------ F( 4, 1) = 237.35
Model | 1066.37679 4 266.594196 Prob > F = 0.0486
Residual | 1.1232144 1 1.1232144 R-squared = 0.9989
-------------+------------------------------ Adj R-squared = 0.9947
Total | 1067.5 5 213.5 Root MSE = 1.0598
------------------------------------------------------------------------------
A | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
B | -1.45e-45 2.32e-46 -6.24 0.101 -4.40e-45 1.50e-45
C | 2.94e-06 3.67e-07 8.01 0.079 -1.72e-06 7.61e-06
D | -3.86e-21 6.11e-22 -6.31 0.100 -1.16e-20 3.90e-21
E | 4.92e-37 7.88e-38 6.24 0.101 -5.09e-37 1.49e-36
_cons | 9.881129 1.07512 9.19 0.069 -3.779564 23.54182
------------------------------------------------------------------------------
And the result in R aligns with Stata: lm(formula = A ~ B + C + D + E, data = stata)
Residuals:
1 2 3 4 5 6
-1.757e-01 8.211e-01 1.287e-03 -1.269e-06 1.289e-09 -6.467e-01
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 9.881e+00 1.075e+00 9.191 0.069 .
B -1.449e-45 2.322e-46 -6.238 0.101
C 2.945e-06 3.674e-07 8.015 0.079 .
D -3.855e-21 6.106e-22 -6.313 0.100
E 4.919e-37 7.879e-38 6.243 0.101
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared: 0.9989, Adjusted R-squared: 0.9947
F-statistic: 237.3 on 4 and 1 DF, p-value: 0.04864
Hence, it appears to me that pandas is having some issue here. Could anyone please help advise?
I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps
gives 2.2204460492503131e-16
, so everything less than eps*max_value_of_your_data
will be treated essentially as 0
when you try any primitive operations like + - * /
. For example, 1e117 + 1e100 == 1e117
returns True
, because 1e100/1e117 = 1e-17 < eps
. Now look at your data.
# your data
# =======================
print(df)
A B C D E
0 10 1.0000e+30 100000 1.0000e+15 1.0000e+25
1 20 1.0737e+39 3200000 3.2768e+19 3.3554e+32
2 30 2.0589e+44 24300000 1.4349e+22 8.4729e+36
3 40 1.1529e+48 102400000 1.0737e+24 1.1259e+40
4 50 9.3132e+50 312500000 3.0518e+25 2.9802e+42
5 21 4.6407e+39 4084101 6.8122e+19 1.1363e+33
When relative precision is taken into account,
# ===================================================
import numpy as np
np.finfo(float).eps # 2.2204460492503131e-16
df[df < df.max().max()*np.finfo(float).eps] = 0
df
A B C D E
0 0 0.0000e+00 0 0 0.0000e+00
1 0 1.0737e+39 0 0 0.0000e+00
2 0 2.0589e+44 0 0 8.4729e+36
3 0 1.1529e+48 0 0 1.1259e+40
4 0 9.3132e+50 0 0 2.9802e+42
5 0 4.6407e+39 0 0 0.0000e+00
So there is no variation at all in y(A)
, and that's why statsmodels
returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.