简体   繁体   中英

Replacing mean from psych::describe for mode in dataframe

I like the summary statistics of psych::describe but I want to replace the mean with the mode but only for factor variables. How do I program Mode's output to replace setosa (or any other factor variable) I use iris for replication even though it has only one.

getMode <- function(df) {
  ux <- na.omit(unique(df))
  ux[which.max(tabulate(match(df, ux)))]
}

Mode <- apply(iris%>% select(where(is.factor)), 2, getMode)

#I only want 5 of psych's descriptive stats plus the mode.
table <- cbind(psych::describe(iris),
                      Mode) [,c(3,4,8,9,2, 14)] 
table

How can I get mean and mode to combine depending on the structure of the variable?

  1. is there a way to combine if_else with where to tell R what to do when FALSE ? If I could get the mean to output when the variable is not a factor, I would get a column that combines means and modes.

Psych produces a dataframe where the identifying variable names are not selectable, so this makes any manual coding or listing the variables in mutate() impossible. They are also the majority of variables in my dataset (so manual or a mutate(case_when) would be REALLY tedious even if it could be done).

PS. I've tried changing my apply() to map functions but the output is not compatible with the cbind() because it will list the other levels for each factor. If you have a better idea about that part of the code or think that's where I could combine getMode and mean() I don't mind suggestions.

If you're willing to use a different function to produce the same kind of output, you could use dplyr and tidyr to accomplish this. Using this approach you could do just what you want with ifelse() to identify numeric or non-numeric variables. The only thing to note is that if you're having the function produce non-numeric values for for factors, the output for the numeric variables also has to be a character. That's why I wrapped the mean() function in sprintf() .

getMode <- function(df) {
  ux <- na.omit(unique(df))
  ux[which.max(tabulate(match(df, ux)))]
}

library(tidyr)
iris %>% 
  summarise_all(.funs = list(
    mean = function(x)ifelse(is.numeric(x), sprintf("%.3f", mean(x)), as.character(getMode(x))), 
    sd = function(x)ifelse(is.numeric(x), sd(x), sd(as.numeric(x))), 
    min = function(x)ifelse(is.numeric(x), sprintf("%.3f", min(x)), levels(x)[1]), 
    max = function(x)ifelse(is.numeric(x), sprintf("%.3f", max(x)), levels(x)[length(levels(x))]), 
    n = function(x)sum(!is.na(x))
  )) %>% 
  pivot_longer(everything(),
        names_to = c("set", ".value"),
        names_pattern = "(.+)_(.+)")
                            
# A tibble: 5 x 6
#            set  mean     sd   min    max         n
#          <chr> <chr>  <dbl> <chr>  <chr>     <int>
# 1 Sepal.Length 5.843  0.828 4.300  7.900       150
# 2 Sepal.Width  3.057  0.436 2.000  4.400       150
# 3 Petal.Length 3.758  1.77  1.000  6.900       150
# 4 Petal.Width  1.199  0.762 0.100  2.500       150
# 5 Species      setosa 0.819 setosa virginica   150    
#     

This also allows you to make other changes as well - for instance above, I replaced the minimum with the first level of Species and the maximum with the last level of Species . Not that this is necessarily what you'd want to do, but it's easy to change the values of the output based on the type of variable.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM