简体   繁体   中英

Removing “NA” values in R (for some reason na.omit and complete.cases only work on one variable…)

The first part of this assignment is to import data from an external http site; the data contains eight variables with 1,339 observations. Four of those variables (age, height, weight, igf1) contain NA values within them (NOTE: the other variables may also have NA values, but I'm not concerned with them). I need to eliminate the NA values in those four variables: this is where I'm struggling.

Here is what I have so far:

#imports dataset from internet
importData <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", sep=',', header=T)
#inspects the data:
str(importData)

Basically, I want to remove ALL NA values in age, height, weight, and igf1. I'll know I'm successful when I have 858 observations remaining.

Three of the variables (height, weight, igf1) contain FACTOR type information. One of the variables (age) contains numeric information. I have been unable to successfully implement complete.cases and/or na.omit across them: those functions only seem to work on $age, where they eliminate the five NA values (but don't touch the other variables)

I need help cutting out the NA values in the remaining variables. Again, when I'm done, I should have 858 observations.

The problem is that your data file encloses all values in quotes, like this:

"age", "height", "menarche", "sex", "igf1", "tanner", "testvol", "weight"
"NA", "NA", "NA", "NA", "90.0", "NA", "NA", "NA"
"NA", "NA", "NA", "NA", "88.0", "NA", "NA", "NA"

When R reads that, it takes all values as strings, which by default represent as factors. The NA s are just taken as the label for one level of those factors. You can see this with str :

> str(importData)
'data.frame':   1339 obs. of  8 variables:
 $ age     : num  NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
 $ height  : Factor w/ 600 levels " 110.8"," 111.5",..: 600 600 600 600 600 600 600 600 600 600 ...
 $ menarche: Factor w/ 3 levels " 1"," 2"," NA": 3 3 3 3 3 3 3 3 3 3 ...
 $ sex     : Factor w/ 3 levels " 1.00"," 2.00",..: 3 3 3 3 3 1 1 1 1 1 ...
 $ igf1    : Factor w/ 501 levels " 100.0"," 101.0",..: 490 487 53 55 23 2 498 6 10 474 ...
 $ tanner  : Factor w/ 6 levels " 1"," 2"," 3",..: 6 6 6 6 6 1 1 1 1 1 ...
 $ testvol : Factor w/ 26 levels " 1"," 10"," 11",..: 26 26 26 26 26 26 26 26 26 26 ...
 $ weight  : Factor w/ 518 levels " 14.1"," 17.9",..: 518 518 518 518 518 518 518 518 518 518 ...

So read your data without making factors, which reads all as numeric:

> importData <- read.csv("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv",
                         stringsAsFactors=F, na.strings=c(NA,"NA"," NA"))
> str(importData)
'data.frame':   1339 obs. of  8 variables:
 $ age     : num  NA NA NA NA NA 0.17 0.17 0.17 0.17 0.17 ...
 $ height  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ menarche: int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex     : num  NA NA NA NA NA 1 1 1 1 1 ...
 $ igf1    : num  90 88 164 166 131 101 97 106 111 79 ...
 $ tanner  : int  NA NA NA NA NA 1 1 1 1 1 ...
 $ testvol : int  NA NA NA NA NA NA NA NA NA NA ...
 $ weight  : num  NA NA NA NA NA NA NA NA NA NA ...

and now remove NA s:

> data <- importData[complete.cases(importData[c("age","height","weight","igf1")]),]
> str(data)
'data.frame':   858 obs. of  8 variables:
 $ age     : num  6 6.08 6.26 6.4 6.42 6.43 6.61 6.63 6.7 6.72 ...
 $ height  : num  112 117 120 116 116 ...
 $ menarche: int  NA NA NA NA NA NA NA NA NA NA ...
 $ sex     : num  1 1 1 1 1 1 1 1 1 1 ...
 $ igf1    : num  98 242 196 179 126 142 236 148 174 136 ...
 $ tanner  : int  1 1 1 1 1 1 1 1 1 1 ...
 $ testvol : int  1 1 1 1 1 1 1 2 1 1 ...
 $ weight  : num  19.1 21.7 24.7 19.6 20.6 20.2 28 21.6 26.1 22.6 ...

Why not make things a lot less stressful and read the .txt version of the data instead? I found it in the data list on the same page as the .csv . It isn't spaced abnormally like the .csv file and there is no fiddling around with read.csv arguments (which I did for quite a while before finding the other file).

con <- "http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.txt"
dat <- read.table(con, header = TRUE)
sapply(dat, class)
#       age    height  menarche       sex      igf1    tanner   testvol    weight 
# "numeric" "numeric" "integer" "numeric" "numeric" "integer" "integer" "numeric" 
cols <- c("age", "height", "weight", "igf1")
resultSet <- dat[complete.cases(dat[cols]), ]
dim(resultSet)
# [1] 858   8

Read in your data using as.is=TRUE and na.strings = c(NA, "NA", " NA")) . Your file has spaces between fields which are interpreted as part of the field.

data <- read.table("http://people.sc.fsu.edu/~jburkardt/datasets/iswr/juul2.csv", 
        sep=",", header=TRUE, as.is=TRUE, na.strings=c(NA, "NA", " NA"))

If you want to remove rows where any of the selected columns are NA use complete.cases :

new.data <- data[complete.cases(data[ ,c("age", "height", "weight", "igf1")]), ]
nrow(new.data)
# [1] 858

If you want to remove only the rows where all of the selected columns are NA , use rowSums :

new.data <- data[rowSums(is.na(data[ ,c("age", "height", "weight", "igf1")])) < 4, ]
nrow(new.data)
# [1] 1339   # there aren't any such rows :)

This isolated the four columns, tests whether they are NA , and for each row calculates how many are NA , returning only the ones that have fewer than 4 NA s in each row.

If you are reading data from file, use the read.table option bellow:

read.table(...,quote="")

If you conditionally created NAs, like:

df[condition] <- NA

make sure you never use it quoted ("NA").

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM