简体   繁体   English

使用 statsmodels.formula.api package ols ZC1C425268E68385D1AB5074C17A94F13BDD2565DDDFCZEBEE7B43 时单个变量的重复列

[英]Repeated columns of a single variable when using statsmodels.formula.api package ols function in python

I am trying to perform multiple linear regression using the statsmodels.formula.api package in python and have listed the code that i have used to perform this regression below.我正在尝试使用 statsmodels.formula.api package 中的 python 执行多元线性回归,并在下面列出了我用来执行此回归的代码。

auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())

The data consists the following variables - mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin and name.数据包含以下变量 - mpg、气缸、排量、马力、重量、加速度、年份、产地和名称。 When the print result comes up, it shows multiple rows of the horsepower column and the regression results are also not correct.当打印结果出现时,显示多行马力列,回归结果也不正确。 Im not sure why?我不确定为什么?

screenshot of repeated rows重复行的屏幕截图

It's likely because of the data type of the horsepower column.这可能是因为horsepower列的数据类型。 If its values are categories or just strings, the model will use treatment (dummy) coding for them by default, producing the results you are seeing.如果它的值是类别或只是字符串,model 将默认为它们使用处理(虚拟)编码,产生您所看到的结果。 Check the data type (run auto_1.dtypes ) and cast the column to a numeric type (it's best to do it when you are first reading the csv file with the dtype= parameter of the read_csv() method.检查数据类型(运行auto_1.dtypes )并将列转换为数字类型(最好在第一次使用read_csv()方法的 dtype dtype=参数读取 csv 文件时执行此操作。

Here is an example where a column with numeric values is cast (ie converted) to strings (or categories):这是一个示例,其中具有数值的列被强制转换(即转换)为字符串(或类别):

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.DataFrame(
    {
        'mpg': np.random.randint(20, 40, 50),
        'horsepower': np.random.randint(100, 200, 50)
    }
)
# convert integers to strings (or categories)
df['horsepower'] = (
    df['horsepower'].astype('str')  # same result with .astype('category')
)

formula = 'mpg ~ horsepower'

results = smf.ols(formula, df).fit()
print(results.summary())

Output (dummy coding): Output(虚拟编码):

OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                 -0.207
Method:                 Least Squares   F-statistic:                    0.7901
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.715
Time:                        20:17:51   Log-Likelihood:                -110.27
No. Observations:                  50   AIC:                             302.5
Df Residuals:                       9   BIC:                             380.9
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            32.0000      5.175      6.184      0.000      20.294      43.706
horsepower[T.103]    -4.0000      7.318     -0.547      0.598     -20.555      12.555
horsepower[T.112]    -1.0000      7.318     -0.137      0.894     -17.555      15.555
horsepower[T.116]    -9.0000      7.318     -1.230      0.250     -25.555       7.555
horsepower[T.117]     6.0000      7.318      0.820      0.433     -10.555      22.555
horsepower[T.118]     2.0000      7.318      0.273      0.791     -14.555      18.555
horsepower[T.120]    -4.0000      6.338     -0.631      0.544     -18.337      10.337

etc.

Now, converting the strings back to integers:现在,将字符串转换回整数:

df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')

results = smf.ols(formula, df).fit()
print(results.summary())

Output (as expected): Output(如预期):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.5388
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.466
Time:                        20:24:54   Log-Likelihood:                -147.65
No. Observations:                  50   AIC:                             299.3
Df Residuals:                      48   BIC:                             303.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     31.7638      3.663      8.671      0.000      24.398      39.129
horsepower    -0.0176      0.024     -0.734      0.466      -0.066       0.031
==============================================================================
Omnibus:                        3.529   Durbin-Watson:                   1.859
Prob(Omnibus):                  0.171   Jarque-Bera (JB):                1.725
Skew:                           0.068   Prob(JB):                        0.422
Kurtosis:                       2.100   Cond. No.                         834.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Statsmodels.formula.api OLS 不显示截距的统计值 - Statsmodels.formula.api OLS does not show statistical values of intercept AttributeError: 模块“statsmodels.formula.api”没有属性“OLS” - AttributeError: module 'statsmodels.formula.api' has no attribute 'OLS' 无法从“statsmodels.formula.api”导入名称“OLS” - cannot import name 'OLS' from 'statsmodels.formula.api' 来自 statsmodels.formula.api 的交互图使用 Python - Interaction Plot from statsmodels.formula.api using Python 使用来自statsmodels.formula.api的ols - 如何删除常量项? - using ols from statsmodels.formula.api - how to remove constant term? 如何使用 statsmodels.formula.api (python) 预测新值 - How to predict new values using statsmodels.formula.api (python) 使用 statsmodels.formula.api 的多项式回归 - Polynomial Regression Using statsmodels.formula.api Python:statsmodels.formula.api:类似python的公式 - Python: statsmodels.formula.api: python-like formula 数字作为变量名称无法被statsmodels.formula.api识别 - Numbers as variable names not recognized by statsmodels.formula.api 使用“statsmodels.formula.api”glm 为因变量指定参考类别 - Specifying reference category with 'statsmodels.formula.api' glm for dependent variable
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM