简体   繁体   中英

Linear regression fails in Python with large values in dependent variables

I'm trying to rewrite a forecasting model (in Stata) using Python (with pandas.stats.api.ols), and ran into an issue with linear regression: the coefficients and intercept computed by pandas do not match with those from Stata.

Investigation shows that the root cause might be the values of the dependent values are very big. I've this suspicion based on the findings below:

1) I created a simple DataFrame in Python and ran linear regression with it:

from pandas.stats.api import ols
import pandas as pd
df = pd.DataFrame({"A": [10,20,30,40,50,21], "B": [20, 30, 10, 40, 50,98], "C": [32, 234, 23, 23, 31,21], "D":[12,28,12,98,51,87], "E": [1,8,12,9,12,91]})
ols(y=df['A'], x=df[['B','C', 'D', 'E']])

and the summary for the LR is:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   5

R-squared:         0.4627
Adj R-squared:    -1.6865

Rmse:             23.9493

F-stat (4, 1):     0.2153, p-value:     0.9026

Degrees of Freedom: model 4, resid 1

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B     0.3212     1.1176       0.29     0.8218    -1.8693     2.5117
             C    -0.0488     0.1361      -0.36     0.7806    -0.3155     0.2178
             D     0.1512     0.4893       0.31     0.8092    -0.8077     1.1101
             E    -0.4508     0.8268      -0.55     0.6822    -2.0713     1.1697
     intercept    20.9222    23.6280       0.89     0.5386   -25.3887    67.2331
---------------------------------End of Summary---------------------------------

I saved this DataFrame to a Stata .dta file, and ran LR in Stata as:

 use "/tmp/lr.dta", clear
 reg A B C D E

The result is the same:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =    0.22
       Model |  493.929019     4  123.482255           Prob > F      =  0.9026
    Residual |  573.570981     1  573.570981           R-squared     =  0.4627
-------------+------------------------------           Adj R-squared = -1.6865
       Total |      1067.5     5       213.5           Root MSE      =  23.949

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |   .3211939   1.117591     0.29   0.822    -13.87914    14.52153
           C |  -.0488429   .1360552    -0.36   0.781    -1.777589    1.679903
           D |   .1512067   .4892539     0.31   0.809    -6.065353    6.367766
           E |  -.4508122   .8267897    -0.55   0.682    -10.95617    10.05455
       _cons |    20.9222   23.62799     0.89   0.539    -279.2998    321.1442
------------------------------------------------------------------------------

I tried this in R, and got the same result.

2) However, if I increased the values of the dependent variables in Python:

df = pd.DataFrame({"A": [10.0,20.0,30.0,40.0,50.0,21.0]})
df['B'] = pow(df['A'], 30)
df['C'] = pow(df['A'], 5)
df['D'] = pow(df['A'], 15)
df['E'] = pow(df['A'], 25)

I've made sure all the columns are using float64 here: df.dtypes A float64 B float64 C float64 D float64 E float64 dtype: object

The result I got is:

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <B> + <C> + <D> + <E> + <intercept>

Number of Observations:         6
Number of Degrees of Freedom:   2

R-squared:        -0.7223
Adj R-squared:    -1.1528

Rmse:             21.4390

F-stat (4, 4):    -1.6775, p-value:     1.0000

Degrees of Freedom: model 1, resid 4

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
             B    -0.0000     0.0000      -0.00     0.9973    -0.0000     0.0000
             C     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             D     0.0000     0.0000       0.00     1.0000    -0.0000     0.0000
             E     0.0000     0.0000       0.00     0.9975    -0.0000     0.0000
     intercept     0.0000    21.7485       0.00     1.0000   -42.6271    42.6271
---------------------------------End of Summary---------------------------------

But in Stata, I got a very different result:

      Source |       SS       df       MS              Number of obs =       6
-------------+------------------------------           F(  4,     1) =  237.35
       Model |  1066.37679     4  266.594196           Prob > F      =  0.0486
    Residual |   1.1232144     1   1.1232144           R-squared     =  0.9989
-------------+------------------------------           Adj R-squared =  0.9947
       Total |      1067.5     5       213.5           Root MSE      =  1.0598

------------------------------------------------------------------------------
           A |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           B |  -1.45e-45   2.32e-46    -6.24   0.101    -4.40e-45    1.50e-45
           C |   2.94e-06   3.67e-07     8.01   0.079    -1.72e-06    7.61e-06
           D |  -3.86e-21   6.11e-22    -6.31   0.100    -1.16e-20    3.90e-21
           E |   4.92e-37   7.88e-38     6.24   0.101    -5.09e-37    1.49e-36
       _cons |   9.881129    1.07512     9.19   0.069    -3.779564    23.54182
------------------------------------------------------------------------------

And the result in R aligns with Stata: lm(formula = A ~ B + C + D + E, data = stata)

Residuals:
         1          2          3          4          5          6 
-1.757e-01  8.211e-01  1.287e-03 -1.269e-06  1.289e-09 -6.467e-01 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)  9.881e+00  1.075e+00   9.191    0.069 .
B           -1.449e-45  2.322e-46  -6.238    0.101  
C            2.945e-06  3.674e-07   8.015    0.079 .
D           -3.855e-21  6.106e-22  -6.313    0.100  
E            4.919e-37  7.879e-38   6.243    0.101  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 1.06 on 1 degrees of freedom
Multiple R-squared:  0.9989,  Adjusted R-squared:  0.9947 
F-statistic: 237.3 on 4 and 1 DF,  p-value: 0.04864

Hence, it appears to me that pandas is having some issue here. Could anyone please help advise?

I think this is because of the relative precision issue in python (not just in python, but most other programming languages as well, like C++). np.finfo(float).eps gives 2.2204460492503131e-16 , so everything less than eps*max_value_of_your_data will be treated essentially as 0 when you try any primitive operations like + - * / . For example, 1e117 + 1e100 == 1e117 returns True , because 1e100/1e117 = 1e-17 < eps . Now look at your data.

# your data
# =======================
print(df)

    A           B          C           D           E
0  10  1.0000e+30     100000  1.0000e+15  1.0000e+25
1  20  1.0737e+39    3200000  3.2768e+19  3.3554e+32
2  30  2.0589e+44   24300000  1.4349e+22  8.4729e+36
3  40  1.1529e+48  102400000  1.0737e+24  1.1259e+40
4  50  9.3132e+50  312500000  3.0518e+25  2.9802e+42
5  21  4.6407e+39    4084101  6.8122e+19  1.1363e+33

When relative precision is taken into account,

# ===================================================
import numpy as np

np.finfo(float).eps # 2.2204460492503131e-16

df[df < df.max().max()*np.finfo(float).eps] = 0
df

   A           B  C  D           E
0  0  0.0000e+00  0  0  0.0000e+00
1  0  1.0737e+39  0  0  0.0000e+00
2  0  2.0589e+44  0  0  8.4729e+36
3  0  1.1529e+48  0  0  1.1259e+40
4  0  9.3132e+50  0  0  2.9802e+42
5  0  4.6407e+39  0  0  0.0000e+00

So there is no variation at all in y(A) , and that's why statsmodels returns all 0 coefficients. As a reminder, it's always a good practice to normalize your data first before running regression.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM