简体   繁体   中英

why summary( a_categorical_var) does not show "NA" counts?

I want to examine my dataset - flights, and use summary() function.

summary(flights["tailnum"])

Results:

   tailnum         
 Length:336776     
 Class :character  
 Mode  :character  

In particular, it does not show that the character variable tailnum has any NAs.

However, when I use sum(is.na(flights$tailnum)) , it shows it has NAs.

[1] 2512

What is the best function to examine a categorical variable - show its levels, missing values, total number of rows and frequencies for each level?

Apparently the summary() method for character variables doesn't report NAs. (This does seem a bit inconsistent, might be worth reporting/discussing on the r-devel@r-project.org mailing list...)

If you convert the variable to a factor and apply summary() to it specifically you'll get a table of the counts of the first 98 levels (followed by an "Other" category and the number of NAs).

summary(factor(flights$tailnum))

If you really want a full tabulation:

tt <- table(flights$tailnum, useNA = "ifany")
print(tt)

Although length(tt) is 4044, telling you that there are 4043 distinct non-NA values (+ NA values): head(table(tt)) and tail(table(tt)) tell you that there are hundreds of values that occur only a few times, and a few values that occur hundreds (or thousands) of times.

If you're using tidyverse and want to convert all character variables to factors:

flights %>% mutate(across(where(is.character), factor))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM