[英]Statsmodels.formula.api OLS does not show statistical values of intercept
[英]Repeated columns of a single variable when using statsmodels.formula.api package ols function in python
我正在嘗試使用 statsmodels.formula.api package 中的 python 執行多元線性回歸,並在下面列出了我用來執行此回歸的代碼。
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())
數據包含以下變量 - mpg、氣缸、排量、馬力、重量、加速度、年份、產地和名稱。 當打印結果出現時,顯示多行馬力列,回歸結果也不正確。 我不確定為什么?
這可能是因為horsepower
列的數據類型。 如果它的值是類別或只是字符串,model 將默認為它們使用處理(虛擬)編碼,產生您所看到的結果。 檢查數據類型(運行auto_1.dtypes
)並將列轉換為數字類型(最好在第一次使用read_csv()
方法的 dtype dtype=
參數讀取 csv 文件時執行此操作。
這是一個示例,其中具有數值的列被強制轉換(即轉換)為字符串(或類別):
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(
{
'mpg': np.random.randint(20, 40, 50),
'horsepower': np.random.randint(100, 200, 50)
}
)
# convert integers to strings (or categories)
df['horsepower'] = (
df['horsepower'].astype('str') # same result with .astype('category')
)
formula = 'mpg ~ horsepower'
results = smf.ols(formula, df).fit()
print(results.summary())
Output(虛擬編碼):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.778
Model: OLS Adj. R-squared: -0.207
Method: Least Squares F-statistic: 0.7901
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.715
Time: 20:17:51 Log-Likelihood: -110.27
No. Observations: 50 AIC: 302.5
Df Residuals: 9 BIC: 380.9
Df Model: 40
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 32.0000 5.175 6.184 0.000 20.294 43.706
horsepower[T.103] -4.0000 7.318 -0.547 0.598 -20.555 12.555
horsepower[T.112] -1.0000 7.318 -0.137 0.894 -17.555 15.555
horsepower[T.116] -9.0000 7.318 -1.230 0.250 -25.555 7.555
horsepower[T.117] 6.0000 7.318 0.820 0.433 -10.555 22.555
horsepower[T.118] 2.0000 7.318 0.273 0.791 -14.555 18.555
horsepower[T.120] -4.0000 6.338 -0.631 0.544 -18.337 10.337
etc.
現在,將字符串轉換回整數:
df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')
results = smf.ols(formula, df).fit()
print(results.summary())
Output(如預期):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.011
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5388
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.466
Time: 20:24:54 Log-Likelihood: -147.65
No. Observations: 50 AIC: 299.3
Df Residuals: 48 BIC: 303.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 31.7638 3.663 8.671 0.000 24.398 39.129
horsepower -0.0176 0.024 -0.734 0.466 -0.066 0.031
==============================================================================
Omnibus: 3.529 Durbin-Watson: 1.859
Prob(Omnibus): 0.171 Jarque-Bera (JB): 1.725
Skew: 0.068 Prob(JB): 0.422
Kurtosis: 2.100 Cond. No. 834.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.