简体   繁体   English

数据帧中为空。 如果在Logistic Regresion模型中更好则如何删除

[英]Nulls in Data frame . How to remove if it is better in Logistic regresion models

I have a Data Frame with two columns that have populations of NULL in them 我有一个包含两列的数据框,其中填充了NULL

  'data.frame': 31337 obs. of  16 variables:
  # $ ID                       : int  1 2 3 5 6 7 8 9 10 11 ...
  # $ Target                   : int  0 0 0 0 0 0 0 0 0 0 ...
  # $ band                     : chr  "3. 35 to 44" "NULL" "NULL" "NULL" ...
  # $ gender                   : chr  "Male" "NULL" "Male" "NULL" ...

a) Do I remove the Rows with "Null" in R or b) do I leave the Null as a seperate category for Logistic Regression in R ? a)是否删除R中带有“ Null”的行或b)是否将Null保留为R中Logistic回归的单独类别?

If the answer to a is yes then how do I do it 如果答案为是,那我该怎么办

There are several things going on here with your question. 您的问题正在发生几件事。

  • "NULL" in your data frame is a character value. 数据框中的“ NULL”是一个字符值。 It is not NULL . 它不是NULL

Eg, 例如,

is.null(NULL)
[1] TRUE
is.null("NULL")
[1] FALSE
  • In R there is a difference between NULL and NA . 在R中, NULLNA之间存在差异。 NULL represents a null or empty object. NULL表示一个空对象或空对象。 It is often returned by functions so that values are undefined. 它通常由函数返回,因此值是不确定的。 NA is a missing value (does not exist). NA是一个缺失值(不存在)。 Based on your context, I would replace your "NULL" values with NA . 根据您的上下文,我将用NA替换您的“ NULL”值。 For a quick way to replace "NULL" with NA , see dplyr::na_if() . 有关用NA替换“ NULL”的快速方法,请参见dplyr::na_if() ( Link to function's documentation.) 链接到功能的文档。)
  • If you are using glm() to carry out your logistic regression model there are several ways glm() handles missing data (NAs). 如果您使用glm()来执行逻辑回归模型,则glm()有几种处理缺失数据(NA)的方法。 You can control how it handles NAs with the argument na.action . 您可以使用参数na.action来控制它如何处理NA。 Run ?glm in the console to pull up the help page for this function. 在控制台中运行?glm ,以拉出此功能的帮助页面。 There is a description of each of the argument's values. 每个参数的值都有说明。

To answer your question about removing NAs or using a dummy indicator for missing values, that's a matter of model intent. 要回答有关删除NA或对缺失值使用虚拟指示器的问题,这是模型意图的问题。 It is difficult to provide a general answer to such a broad topic without more details. 如果没有更多细节,很难为这样一个广泛的话题提供一般性的答案。

@jordan .. Fantastic advice .. dataframe shrunk to 14% of size @jordan ..很棒的建议..数据框缩小到大小的14%

data=na_if(data,"NULL") data <- data[!is.na(data$age_band) & !is.na(data$gender), ] data = na_if(data,“ NULL”)data <-data [!is.na(data $ age_band)&!is.na(data $ gender),]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM