简体   繁体   中英

Model Analysis IN R ( Logistic Regression)

I have a data file ( 1 million rows) that has one outcome variable as Status ( Yes / no ) with three continuous variables and 5 nominal variables ( 5 categories in each variable ) I want to predict the outcome ie status. I wanted to know which type of analysis is good for building up the model. I have seen logit, probit, logistic regression. I am confused on what to start and analyse the variables that are more likely useful for analysis.

data file: gender,region,age,company,speciality,jobrole,diag,labs,orders,status

M,west,41,PA,FPC, Assistant,code18,27,3,yes

M,Southwest,65,CV,FPC,Worker,code18,69,11,no

M,South,27,DV,IMC,Assistant,invalid,62,13,no

M,Southwest,18,CV,IMC,Worker,code8,6,1,yes

PS: Using R language. Any help would be greatly appreciated Thanks !

Given the three, most usually start their analysis with Logistic regression.

Note that, Logistic and Logit are the same thing.

While deciding between Logistic and Probit, go for Logistic.

Probit usually returns results faster, while Logistic has a better edge for interpretation result.

Now, to settle on variables - You can vary the number of variables that you are going to use in your model.

model1 <- glm(status ~., data = df, family = binomial(link = 'logit'))

Now, check the model summary and check the importance of predictor variables.

model2 <- glm(status ~ gender + region + age + company + speciality + jobrole + diag + labs, data = df, family = binomial(link = 'logit'))

With reducing the number of variables you would better be able to identify what variables are important.

Also, ensure that you have performed data cleaning prior to this.

Avoid including highly correlated variables, you can check them using cor()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM