使用 statsmodels.formula.api package ols ZC1C425268E68385D1AB5074C17A94F13BDD2565DDDFCZEBEE7B43 时单个变量的重复列

Question

I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.我正在尝试使用 statsmodels.formula.api package 中的 python 执行多元线性回归，并在下面列出了我用来执行此回归的代码。

auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())

The data consists the following variables - mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin and name.数据包含以下变量 - mpg、气缸、排量、马力、重量、加速度、年份、产地和名称。 When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct.当打印结果出现时，显示多行马力列，回归结果也不正确。 Im not sure why?我不确定为什么？

screenshot of repeated rows重复行的屏幕截图

Answer 1

It's likely because of the data type of the horsepower column.这可能是因为horsepower列的数据类型。 If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing.如果它的值是类别或只是字符串，model 将默认为它们使用处理（虚拟）编码，产生您所看到的结果。 Check the data type (run auto_1.dtypes ) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.检查数据类型（运行auto_1.dtypes ）并将列转换为数字类型（最好在第一次使用read_csv()方法的 dtype dtype=参数读取 csv 文件时执行此操作。

Here is an example where a column with numeric values is cast (ie converted) to strings (or categories):这是一个示例，其中具有数值的列被强制转换（即转换）为字符串（或类别）：

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.DataFrame(
    {
        'mpg': np.random.randint(20, 40, 50),
        'horsepower': np.random.randint(100, 200, 50)
    }
)
# convert integers to strings (or categories)
df['horsepower'] = (
    df['horsepower'].astype('str')  # same result with .astype('category')
)

formula = 'mpg ~ horsepower'

results = smf.ols(formula, df).fit()
print(results.summary())

Output (dummy coding): Output（虚拟编码）：

OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                 -0.207
Method:                 Least Squares   F-statistic:                    0.7901
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.715
Time:                        20:17:51   Log-Likelihood:                -110.27
No. Observations:                  50   AIC:                             302.5
Df Residuals:                       9   BIC:                             380.9
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            32.0000      5.175      6.184      0.000      20.294      43.706
horsepower[T.103]    -4.0000      7.318     -0.547      0.598     -20.555      12.555
horsepower[T.112]    -1.0000      7.318     -0.137      0.894     -17.555      15.555
horsepower[T.116]    -9.0000      7.318     -1.230      0.250     -25.555       7.555
horsepower[T.117]     6.0000      7.318      0.820      0.433     -10.555      22.555
horsepower[T.118]     2.0000      7.318      0.273      0.791     -14.555      18.555
horsepower[T.120]    -4.0000      6.338     -0.631      0.544     -18.337      10.337

etc.

Now, converting the strings back to integers:现在，将字符串转换回整数：

df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')

results = smf.ols(formula, df).fit()
print(results.summary())

Output (as expected): Output（如预期）：

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.5388
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.466
Time:                        20:24:54   Log-Likelihood:                -147.65
No. Observations:                  50   AIC:                             299.3
Df Residuals:                      48   BIC:                             303.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     31.7638      3.663      8.671      0.000      24.398      39.129
horsepower    -0.0176      0.024     -0.734      0.466      -0.066       0.031
==============================================================================
Omnibus:                        3.529   Durbin-Watson:                   1.859
Prob(Omnibus):                  0.171   Jarque-Bera (JB):                1.725
Skew:                           0.068   Prob(JB):                        0.422
Kurtosis:                       2.100   Cond. No.                         834.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

使用 statsmodels.formula.api package ols ZC1C425268E68385D1AB5074C17A94F13BDD2565DDDFCZEBEE7B43 时单个变量的重复列

问题描述

1 个解决方案

解决方案1
0 2022-09-18 20:29:56

使用 statsmodels.formula.api package ols ZC1C425268E68385D1AB5074C17A94F13BDD2565DDDFCZEBEE7B43 时单个变量的重复列

问题描述

1 个解决方案

解决方案1 0 2022-09-18 20:29:56

解决方案1
0 2022-09-18 20:29:56