Nulls in Data frame . How to remove if it is better in Logistic regresion models

Question

I have a Data Frame with two columns that have populations of NULL in them

  'data.frame': 31337 obs. of  16 variables:
  # $ ID                       : int  1 2 3 5 6 7 8 9 10 11 ...
  # $ Target                   : int  0 0 0 0 0 0 0 0 0 0 ...
  # $ band                     : chr  "3. 35 to 44" "NULL" "NULL" "NULL" ...
  # $ gender                   : chr  "Male" "NULL" "Male" "NULL" ...

a) Do I remove the Rows with "Null" in R or b) do I leave the Null as a seperate category for Logistic Regression in R ?

If the answer to a is yes then how do I do it

Answer 1

There are several things going on here with your question.

"NULL" in your data frame is a character value. It is not NULL .

Eg,

is.null(NULL)
[1] TRUE
is.null("NULL")
[1] FALSE

In R there is a difference between NULL and NA . NULL represents a null or empty object. It is often returned by functions so that values are undefined. NA is a missing value (does not exist). Based on your context, I would replace your "NULL" values with NA . For a quick way to replace "NULL" with NA , see dplyr::na_if() . ( Link to function's documentation.)
If you are using glm() to carry out your logistic regression model there are several ways glm() handles missing data (NAs). You can control how it handles NAs with the argument na.action . Run ?glm in the console to pull up the help page for this function. There is a description of each of the argument's values.

To answer your question about removing NAs or using a dummy indicator for missing values, that's a matter of model intent. It is difficult to provide a general answer to such a broad topic without more details.

Answer 2

@jordan .. Fantastic advice .. dataframe shrunk to 14% of size

data=na_if(data,"NULL") data <- data[!is.na(data$age_band) & !is.na(data$gender), ]

Nulls in Data frame . How to remove if it is better in Logistic regresion models

Question

2 answers

solution1
2 2018-05-25 03:23:14

solution2
0 2018-05-25 10:07:46

Nulls in Data frame . How to remove if it is better in Logistic regresion models

Question

2 answers

solution1 2 2018-05-25 03:23:14

solution2 0 2018-05-25 10:07:46

solution1
2 2018-05-25 03:23:14

solution2
0 2018-05-25 10:07:46