简体   繁体   中英

Linear Regression model with dummy (dependent) variable and categorical (independent) variable in R

My data are dummy variables (1 = if disclosed, 0 = not disclosed) as dependent variable and categorical variable (five types of sectors) as independent variable.

With these data, can a linear regression model be used?

My objectives are to identify which sectors do or do not disclose.

So is it a good way to use?, for example:

summary(lm(Disclosed ~ 0 + Sectors, data = df_0))

I add in the model " 0 + ", so that it also returns the first sector, eliminating the intercept. If I don't add it, I don't know why the first sector doesn't return it to me. I am very lost. Thanks!

If I use a binomial logistic regression, the significance values that I obtain with the estimated sign that it indicates will not be interpreted.

Call:
glm(formula = Disclosed ~ 0 + Sectors, family = binomial(link = "logit"), 
    data = df_0)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.96954  -0.32029  -0.00005  -0.00005   2.48638  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
SectorsCOMMUNICATION      -0.5108     0.5164  -0.989  0.32256    
SectorsCONSIMERSTAPLES   -20.5661  6268.6324  -0.003  0.99738    
SectorsCONSUMERDISCRET    -3.0445     1.0235  -2.975  0.00293 ** 
SectorsENERGY            -20.5661  3780.1276  -0.005  0.99566    
SectorsFINANCIALS         -2.9444     0.7255  -4.059 4.94e-05 ***
SectorsHEALTHCARE        -20.5661  5345.9077  -0.004  0.99693    
SectorsINDUSTRIALS       -20.5661  2803.4176  -0.007  0.99415    
SectorsINDUSTRIALS       -20.5661 17730.3699  -0.001  0.99907    
SectorsINFORMATION        -1.0986     0.8165  -1.346  0.17846    
SectorsMATERIALS         -20.5661  3780.1276  -0.005  0.99566    
SectorsREALESTATE        -20.5661  8865.1850  -0.002  0.99815    
SectorsUTILITIES         -20.5661  7238.3932  -0.003  0.99773    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 277.259  on 200  degrees of freedom
Residual deviance:  54.185  on 188  degrees of freedom
AIC: 78.185

Number of Fisher Scoring iterations: 19

This means that the financial and consumer discretionary sectors are the least disclosed, right?

On the other hand, if I apply an lm, it returns more consistent results. The sectors that spread the most are information and communication. They are significant and positive estimate values

Call:
lm(formula = Disclosed ~ 0 + Sectors, data = df_0)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3750 -0.0500  0.0000  0.0000  0.9546 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
SectorsCOMMUNICATION   3.750e-01  5.191e-02   7.224 1.22e-11 ***
SectorsCONSIMERSTAPLES 0.000e+00  7.341e-02   0.000 1.000000    
SectorsCONSUMERDISCRET 4.545e-02  4.427e-02   1.027 0.305815    
SectorsENERGY          0.000e+00  4.427e-02   0.000 1.000000    
SectorsFINANCIALS      5.000e-02  3.283e-02   1.523 0.129426    
SectorsHEALTHCARE      0.000e+00  6.260e-02   0.000 1.000000    
SectorsINDUSTRIALS     2.194e-18  3.283e-02   0.000 1.000000    
SectorsINDUSTRIALS     0.000e+00  2.076e-01   0.000 1.000000    
SectorsINFORMATION     2.500e-01  7.341e-02   3.406 0.000807 ***
SectorsMATERIALS       0.000e+00  4.427e-02   0.000 1.000000    
SectorsREALESTATE      0.000e+00  1.038e-01   0.000 1.000000    
SectorsUTILITIES       1.416e-17  8.476e-02   0.000 1.000000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2076 on 188 degrees of freedom
Multiple R-squared:  0.2632,    Adjusted R-squared:  0.2162 
F-statistic: 5.597 on 12 and 188 DF,  p-value: 3.568e-08

It would be better to use logistic regression for this particular problem.

Regarding Linear Regression output, for categorical inputs (independent variables), lm takes first class/category in alphabetical order as the base class shown in intercept and returns relative results of other classes to that.

In the example, category A will be intercept and we will have relative result for other classes to class A

For example,

set.seed(100)

a <- sample(c(1,0), 100, replace = TRUE)
b <- sample(c('A', 'B', 'C', 'D', 'E'), 100, replace = TRUE)

lm(a ~ b)
Call:
lm(formula = a ~ b)

Coefficients:
(Intercept)           bB           bC           bD           bE  
   0.562500    -0.183190     0.104167    -0.107955    -0.006944  

is same to

Call:
lm(formula = a ~ 0 + b)

Coefficients:
    bA      bB      bC      bD      bE  
0.5625  0.3793  0.6667  0.4545  0.5556  
c <- broom::tidy(lm(a ~ 0 + b))
c$estimate
[1] 0.5625000 0.3793103 0.6666667 0.4545455 0.5555556

d <- broom::tidy(lm(a ~ b))
d$estimate
[1]  0.562500000 -0.183189655  0.104166667 -0.107954545 -0.006944444

d$estimate[2:5] + d$estimate[1]
[1] 0.3793103 0.6666667 0.4545455 0.5555556

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM