[英]Repeated columns of a single variable when using statsmodels.formula.api package ols function in python
I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.我正在尝试使用 statsmodels.formula.api package 中的 python 执行多元线性回归,并在下面列出了我用来执行此回归的代码。
auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())
The data consists the following variables - mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin and name.数据包含以下变量 - mpg、气缸、排量、马力、重量、加速度、年份、产地和名称。 When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct.当打印结果出现时,显示多行马力列,回归结果也不正确。 Im not sure why?我不确定为什么?
It's likely because of the data type of the horsepower
column.这可能是因为horsepower
列的数据类型。 If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing.如果它的值是类别或只是字符串,model 将默认为它们使用处理(虚拟)编码,产生您所看到的结果。 Check the data type (run auto_1.dtypes
) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype=
parameter of the read_csv()
method.检查数据类型(运行auto_1.dtypes
)并将列转换为数字类型(最好在第一次使用read_csv()
方法的 dtype dtype=
参数读取 csv 文件时执行此操作。
Here is an example where a column with numeric values is cast (ie converted) to strings (or categories):这是一个示例,其中具有数值的列被强制转换(即转换)为字符串(或类别):
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
df = pd.DataFrame(
{
'mpg': np.random.randint(20, 40, 50),
'horsepower': np.random.randint(100, 200, 50)
}
)
# convert integers to strings (or categories)
df['horsepower'] = (
df['horsepower'].astype('str') # same result with .astype('category')
)
formula = 'mpg ~ horsepower'
results = smf.ols(formula, df).fit()
print(results.summary())
Output (dummy coding): Output(虚拟编码):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.778
Model: OLS Adj. R-squared: -0.207
Method: Least Squares F-statistic: 0.7901
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.715
Time: 20:17:51 Log-Likelihood: -110.27
No. Observations: 50 AIC: 302.5
Df Residuals: 9 BIC: 380.9
Df Model: 40
Covariance Type: nonrobust
=====================================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------------
Intercept 32.0000 5.175 6.184 0.000 20.294 43.706
horsepower[T.103] -4.0000 7.318 -0.547 0.598 -20.555 12.555
horsepower[T.112] -1.0000 7.318 -0.137 0.894 -17.555 15.555
horsepower[T.116] -9.0000 7.318 -1.230 0.250 -25.555 7.555
horsepower[T.117] 6.0000 7.318 0.820 0.433 -10.555 22.555
horsepower[T.118] 2.0000 7.318 0.273 0.791 -14.555 18.555
horsepower[T.120] -4.0000 6.338 -0.631 0.544 -18.337 10.337
etc.
Now, converting the strings back to integers:现在,将字符串转换回整数:
df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')
results = smf.ols(formula, df).fit()
print(results.summary())
Output (as expected): Output(如预期):
OLS Regression Results
==============================================================================
Dep. Variable: mpg R-squared: 0.011
Model: OLS Adj. R-squared: -0.010
Method: Least Squares F-statistic: 0.5388
Date: Sun, 18 Sep 2022 Prob (F-statistic): 0.466
Time: 20:24:54 Log-Likelihood: -147.65
No. Observations: 50 AIC: 299.3
Df Residuals: 48 BIC: 303.1
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 31.7638 3.663 8.671 0.000 24.398 39.129
horsepower -0.0176 0.024 -0.734 0.466 -0.066 0.031
==============================================================================
Omnibus: 3.529 Durbin-Watson: 1.859
Prob(Omnibus): 0.171 Jarque-Bera (JB): 1.725
Skew: 0.068 Prob(JB): 0.422
Kurtosis: 2.100 Cond. No. 834.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.