I would like to calculate percentage of NA
-values in a dataframe and for variables.
My dataframe has this:
mean(is.na(dataframe))
# 0.03354
How I read this result? Na 0,033%? I don't understand.
For the individual variables I did the following for the count of NA
s
sapply(DATAFRAME, function(x) sum(is.na(x)))
Then, for the percentage of NA
-values:
colMeans(is.na(VARIABLEX))
Doesn't work because I get the following error:
"x must be an array of at least two dimension"
Why does this error occur? Anyway, afterwards I tried the following:
mean(is.na(VariableX))
# 0.1188
Should I interpret this as having 0.11% NA
-values?
I'd just divide the number of rows containing NAs by the total number of rows:
df <- data.frame(data = c(NA, NA, NA, NA, 2, 4, NA, 7, NA))
percent_NA <- NROW(df[is.na(df$data),])/NROW(df)
Which gives:
> percent_NA
[1] 0.6666667
Which means there are 66,67% NAs in my dataframe
I don't understand the issue you are trying to solve. It all works as expected.
First, a dataset since you haven't provided one.
set.seed(6180) # make it reproducible
dat <- data.frame(x = sample(c(1:4, NA), 100, TRUE),
y = sample(c(1:5, NA), 100, TRUE))
Now the code for sums.
s <- sapply(dat, function(x) sum(is.na(x)))
s
# x y
#18 13
sum(s)
#[1] 31
sum(is.na(dat))
#[1] 31
colSums(is.na(dat))
# x y
#18 13
The same goes for means, be it mean
or colMeans
.
EDIT.
Here is the code to get the means of NA
values per column/variable and a grand total.
sapply(dat, function(x) mean(is.na(x)))
# x y
#0.18 0.13
colMeans(is.na(dat)) # Same result, faster
# x y
#0.18 0.13
mean(is.na(dat)) # overall mean
#[1] 0.155
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.