My data are dummy variables (1 = if disclosed, 0 = not disclosed) as dependent variable and categorical variable (five types of sectors) as independent variable.
With these data, can a linear regression model be used?
My objectives are to identify which sectors do or do not disclose.
So is it a good way to use?, for example:
summary(lm(Disclosed ~ 0 + Sectors, data = df_0))
I add in the model " 0 + ", so that it also returns the first sector, eliminating the intercept. If I don't add it, I don't know why the first sector doesn't return it to me. I am very lost. Thanks!
If I use a binomial logistic regression, the significance values that I obtain with the estimated sign that it indicates will not be interpreted.
Call:
glm(formula = Disclosed ~ 0 + Sectors, family = binomial(link = "logit"),
data = df_0)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.96954 -0.32029 -0.00005 -0.00005 2.48638
Coefficients:
Estimate Std. Error z value Pr(>|z|)
SectorsCOMMUNICATION -0.5108 0.5164 -0.989 0.32256
SectorsCONSIMERSTAPLES -20.5661 6268.6324 -0.003 0.99738
SectorsCONSUMERDISCRET -3.0445 1.0235 -2.975 0.00293 **
SectorsENERGY -20.5661 3780.1276 -0.005 0.99566
SectorsFINANCIALS -2.9444 0.7255 -4.059 4.94e-05 ***
SectorsHEALTHCARE -20.5661 5345.9077 -0.004 0.99693
SectorsINDUSTRIALS -20.5661 2803.4176 -0.007 0.99415
SectorsINDUSTRIALS -20.5661 17730.3699 -0.001 0.99907
SectorsINFORMATION -1.0986 0.8165 -1.346 0.17846
SectorsMATERIALS -20.5661 3780.1276 -0.005 0.99566
SectorsREALESTATE -20.5661 8865.1850 -0.002 0.99815
SectorsUTILITIES -20.5661 7238.3932 -0.003 0.99773
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 277.259 on 200 degrees of freedom
Residual deviance: 54.185 on 188 degrees of freedom
AIC: 78.185
Number of Fisher Scoring iterations: 19
This means that the financial and consumer discretionary sectors are the least disclosed, right?
On the other hand, if I apply an lm, it returns more consistent results. The sectors that spread the most are information and communication. They are significant and positive estimate values
Call:
lm(formula = Disclosed ~ 0 + Sectors, data = df_0)
Residuals:
Min 1Q Median 3Q Max
-0.3750 -0.0500 0.0000 0.0000 0.9546
Coefficients:
Estimate Std. Error t value Pr(>|t|)
SectorsCOMMUNICATION 3.750e-01 5.191e-02 7.224 1.22e-11 ***
SectorsCONSIMERSTAPLES 0.000e+00 7.341e-02 0.000 1.000000
SectorsCONSUMERDISCRET 4.545e-02 4.427e-02 1.027 0.305815
SectorsENERGY 0.000e+00 4.427e-02 0.000 1.000000
SectorsFINANCIALS 5.000e-02 3.283e-02 1.523 0.129426
SectorsHEALTHCARE 0.000e+00 6.260e-02 0.000 1.000000
SectorsINDUSTRIALS 2.194e-18 3.283e-02 0.000 1.000000
SectorsINDUSTRIALS 0.000e+00 2.076e-01 0.000 1.000000
SectorsINFORMATION 2.500e-01 7.341e-02 3.406 0.000807 ***
SectorsMATERIALS 0.000e+00 4.427e-02 0.000 1.000000
SectorsREALESTATE 0.000e+00 1.038e-01 0.000 1.000000
SectorsUTILITIES 1.416e-17 8.476e-02 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2076 on 188 degrees of freedom
Multiple R-squared: 0.2632, Adjusted R-squared: 0.2162
F-statistic: 5.597 on 12 and 188 DF, p-value: 3.568e-08
It would be better to use logistic regression for this particular problem.
Regarding Linear Regression output, for categorical inputs (independent variables), lm
takes first class/category in alphabetical order as the base class shown in intercept
and returns relative results of other classes to that.
In the example, category A
will be intercept and we will have relative result for other classes to class A
For example,
set.seed(100)
a <- sample(c(1,0), 100, replace = TRUE)
b <- sample(c('A', 'B', 'C', 'D', 'E'), 100, replace = TRUE)
lm(a ~ b)
Call:
lm(formula = a ~ b)
Coefficients:
(Intercept) bB bC bD bE
0.562500 -0.183190 0.104167 -0.107955 -0.006944
is same to
Call:
lm(formula = a ~ 0 + b)
Coefficients:
bA bB bC bD bE
0.5625 0.3793 0.6667 0.4545 0.5556
c <- broom::tidy(lm(a ~ 0 + b))
c$estimate
[1] 0.5625000 0.3793103 0.6666667 0.4545455 0.5555556
d <- broom::tidy(lm(a ~ b))
d$estimate
[1] 0.562500000 -0.183189655 0.104166667 -0.107954545 -0.006944444
d$estimate[2:5] + d$estimate[1]
[1] 0.3793103 0.6666667 0.4545455 0.5555556
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.