簡體   English   中英

使用 statsmodels.formula.api package ols ZC1C425268E68385D1AB5074C17A94F13BDD2565DDDFCZEBEE7B43 時單個變量的重復列

[英]Repeated columns of a single variable when using statsmodels.formula.api package ols function in python

我正在嘗試使用 statsmodels.formula.api package 中的 python 執行多元線性回歸,並在下面列出了我用來執行此回歸的代碼。

auto_1= pd.read_csv("Auto.csv")
formula = 'mpg ~ ' + " + ".join(auto_1.columns[1:-1])
results = smf.ols(formula, data=auto_1).fit()
print(results.summary())

數據包含以下變量 - mpg、氣缸、排量、馬力、重量、加速度、年份、產地和名稱。 當打印結果出現時,顯示多行馬力列,回歸結果也不正確。 我不確定為什么?

重復行的屏幕截圖

這可能是因為horsepower列的數據類型。 如果它的值是類別或只是字符串,model 將默認為它們使用處理(虛擬)編碼,產生您所看到的結果。 檢查數據類型(運行auto_1.dtypes )並將列轉換為數字類型(最好在第一次使用read_csv()方法的 dtype dtype=參數讀取 csv 文件時執行此操作。

這是一個示例,其中具有數值的列被強制轉換(即轉換)為字符串(或類別):

import numpy as np
import pandas as pd
import statsmodels.formula.api as smf

df = pd.DataFrame(
    {
        'mpg': np.random.randint(20, 40, 50),
        'horsepower': np.random.randint(100, 200, 50)
    }
)
# convert integers to strings (or categories)
df['horsepower'] = (
    df['horsepower'].astype('str')  # same result with .astype('category')
)

formula = 'mpg ~ horsepower'

results = smf.ols(formula, df).fit()
print(results.summary())

Output(虛擬編碼):

OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.778
Model:                            OLS   Adj. R-squared:                 -0.207
Method:                 Least Squares   F-statistic:                    0.7901
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.715
Time:                        20:17:51   Log-Likelihood:                -110.27
No. Observations:                  50   AIC:                             302.5
Df Residuals:                       9   BIC:                             380.9
Df Model:                          40                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept            32.0000      5.175      6.184      0.000      20.294      43.706
horsepower[T.103]    -4.0000      7.318     -0.547      0.598     -20.555      12.555
horsepower[T.112]    -1.0000      7.318     -0.137      0.894     -17.555      15.555
horsepower[T.116]    -9.0000      7.318     -1.230      0.250     -25.555       7.555
horsepower[T.117]     6.0000      7.318      0.820      0.433     -10.555      22.555
horsepower[T.118]     2.0000      7.318      0.273      0.791     -14.555      18.555
horsepower[T.120]    -4.0000      6.338     -0.631      0.544     -18.337      10.337

etc.

現在,將字符串轉換回整數:

df['horsepower'] = pd.to_numeric(df.horsepower)
# or df['horsepower'] = df['horsepower'].astype('int')

results = smf.ols(formula, df).fit()
print(results.summary())

Output(如預期):

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                    mpg   R-squared:                       0.011
Model:                            OLS   Adj. R-squared:                 -0.010
Method:                 Least Squares   F-statistic:                    0.5388
Date:                Sun, 18 Sep 2022   Prob (F-statistic):              0.466
Time:                        20:24:54   Log-Likelihood:                -147.65
No. Observations:                  50   AIC:                             299.3
Df Residuals:                      48   BIC:                             303.1
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     31.7638      3.663      8.671      0.000      24.398      39.129
horsepower    -0.0176      0.024     -0.734      0.466      -0.066       0.031
==============================================================================
Omnibus:                        3.529   Durbin-Watson:                   1.859
Prob(Omnibus):                  0.171   Jarque-Bera (JB):                1.725
Skew:                           0.068   Prob(JB):                        0.422
Kurtosis:                       2.100   Cond. No.                         834.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM