简体   繁体   中英

How to construct cubic regression using statsmodels.api?

I have been trying to make a cubic regression for a few days now, but I encounter the same problem: my result does not coincide with the code I wrote in R to check. The databases are completely the same, so this is not the problem. The code that I have right now is something like this:

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
import statsmodels.api as sm
import numpy as np

df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
x, y = np.array(df.lstat), np.array(df.crim)
polynomial_features= PolynomialFeatures(degree=3)
xp = polynomial_features.fit_transform(x.reshape(-1,1))
model = sm.OLS(y, xp).fit()
print(model.summary())

I have also made something like this:

import pandas as pd
import statsmodels.api as sm
import numpy as np
import statsmodels.formula.api as smf

df = pd.read_csv("http://web.stanford.edu/~oleg2/hse/boston/boston_house_prices.csv")
df = df.dropna()
ft1 = smf.ols(formula=f"crim ~ lstat + I(np.power(lstat,2)) + I(np.power(lstat,3))", data=df).fit()
print(ft1.summary())

These two give completely the same result:

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   crim   R-squared:                       0.218
Model:                            OLS   Adj. R-squared:                  0.213
Method:                 Least Squares   F-statistic:                     46.63
Date:                Sat, 03 Oct 2020   Prob (F-statistic):           1.35e-26
Time:                        10:26:13   Log-Likelihood:                -1744.2
No. Observations:                 506   AIC:                             3496.
Df Residuals:                     502   BIC:                             3513.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                 1.2010      2.029      0.592      0.554      -2.785       5.187
lstat                    -0.4491      0.465     -0.966      0.335      -1.362       0.464
I(np.power(lstat, 2))     0.0558      0.030      1.852      0.065      -0.003       0.115
I(np.power(lstat, 3))    -0.0009      0.001     -1.517      0.130      -0.002       0.000
==============================================================================
Omnibus:                      607.734   Durbin-Watson:                   1.239
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            53621.219
Skew:                           5.726   Prob(JB):                         0.00
Kurtosis:                      52.114   Cond. No.                     5.20e+04
==============================================================================

And here is the program on R:

fit.lstat2 <- lm(crim ~ poly(lstat, 3))
summary(fit.lstat2)

And it gives the following result:

## Call:
## lm(formula = crim ~ poly(lstat, 3))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.234  -2.151  -0.486   0.066  83.353 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       3.6135     0.3392  10.654   <2e-16 ***
## poly(lstat, 3)1  88.0697     7.6294  11.543   <2e-16 ***
## poly(lstat, 3)2  15.8882     7.6294   2.082   0.0378 *  
## poly(lstat, 3)3 -11.5740     7.6294  -1.517   0.1299    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.629 on 502 degrees of freedom
## Multiple R-squared:  0.2179, Adjusted R-squared:  0.2133 
## F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

This result is the correct one, but I don't know why python gives the wrong answer. Perhaps I should use some different approaches?

PS I have made sure that both codes work, including the database link. So feel free to run them to see the results for yourself.

I am not expert in R but, i suppose you dont use orthogonal polynomial, so you have to set raw=TRUE

I have the same result than python statsmodels.api when i use raw=TRUE for R process:

fit.lstat2 <- lm(crim ~ poly(lstat, 3, raw=TRUE))
summary(fit.lstat2)

result:

Call:
lm(formula = crim ~ poly(lstat, 3, raw = TRUE))

Residuals:
    Min      1Q  Median      3Q     Max 
-15.234  -2.151  -0.486   0.066  83.353 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)  
(Intercept)                  1.2009656  2.0286452   0.592   0.5541  
poly(lstat, 3, raw = TRUE)1 -0.4490656  0.4648911  -0.966   0.3345  
poly(lstat, 3, raw = TRUE)2  0.0557794  0.0301156   1.852   0.0646 .
poly(lstat, 3, raw = TRUE)3 -0.0008574  0.0005652  -1.517   0.1299  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.629 on 502 degrees of freedom
Multiple R-squared:  0.2179,    Adjusted R-squared:  0.2133 
F-statistic: 46.63 on 3 and 502 DF,  p-value: < 2.2e-16

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM