I have a Data Frame with two columns that have populations of NULL in them
'data.frame': 31337 obs. of 16 variables:
# $ ID : int 1 2 3 5 6 7 8 9 10 11 ...
# $ Target : int 0 0 0 0 0 0 0 0 0 0 ...
# $ band : chr "3. 35 to 44" "NULL" "NULL" "NULL" ...
# $ gender : chr "Male" "NULL" "Male" "NULL" ...
a) Do I remove the Rows with "Null" in R or b) do I leave the Null as a seperate category for Logistic Regression in R ?
If the answer to a is yes then how do I do it
There are several things going on here with your question.
NULL
. Eg,
is.null(NULL)
[1] TRUE
is.null("NULL")
[1] FALSE
NULL
and NA
. NULL
represents a null or empty object. It is often returned by functions so that values are undefined. NA
is a missing value (does not exist). Based on your context, I would replace your "NULL" values with NA
. For a quick way to replace "NULL" with NA
, see dplyr::na_if()
. ( Link to function's documentation.) glm()
to carry out your logistic regression model there are several ways glm()
handles missing data (NAs). You can control how it handles NAs with the argument na.action
. Run ?glm
in the console to pull up the help page for this function. There is a description of each of the argument's values. To answer your question about removing NAs or using a dummy indicator for missing values, that's a matter of model intent. It is difficult to provide a general answer to such a broad topic without more details.
@jordan .. Fantastic advice .. dataframe shrunk to 14% of size
data=na_if(data,"NULL") data <- data[!is.na(data$age_band) & !is.na(data$gender), ]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.