简体   繁体   English

在 dataframe 中替换 psych::describe 的均值模式

[英]Replacing mean from psych::describe for mode in dataframe

I like the summary statistics of psych::describe but I want to replace the mean with the mode but only for factor variables.我喜欢 psych::describe 的汇总统计数据,但我想用模式替换均值,但仅限于因子变量。 How do I program Mode's output to replace setosa (or any other factor variable) I use iris for replication even though it has only one.我如何对 Mode 的 output 进行编程以替换 setosa(或任何其他因子变量) 我使用 iris 进行复制,即使它只有一个。

getMode <- function(df) {
  ux <- na.omit(unique(df))
  ux[which.max(tabulate(match(df, ux)))]
}

Mode <- apply(iris%>% select(where(is.factor)), 2, getMode)

#I only want 5 of psych's descriptive stats plus the mode.
table <- cbind(psych::describe(iris),
                      Mode) [,c(3,4,8,9,2, 14)] 
table

How can I get mean and mode to combine depending on the structure of the variable?我怎样才能根据变量的结构来组合均值和众数?

  1. is there a way to combine if_else with where to tell R what to do when FALSE ?有没有办法将if_else与 where to tell R 当FALSE时做什么? If I could get the mean to output when the variable is not a factor, I would get a column that combines means and modes.如果我能在变量不是一个因子时得到 output 的平均值,我会得到一个结合了平均值和模式的列。

Psych produces a dataframe where the identifying variable names are not selectable, so this makes any manual coding or listing the variables in mutate() impossible. Psych生成 dataframe,其中标识变量名称不可选择,因此这使得任何手动编码或列出 mutate() 中的变量变得不可能。 They are also the majority of variables in my dataset (so manual or a mutate(case_when) would be REALLY tedious even if it could be done).它们也是我数据集中的大多数变量(因此即使可以完成,手动或 mutate(case_when) 也会非常乏味)。

PS.附言。 I've tried changing my apply() to map functions but the output is not compatible with the cbind() because it will list the other levels for each factor.我尝试将我的apply()更改为map函数,但 output 与cbind()不兼容,因为它会列出每个因素的其他级别。 If you have a better idea about that part of the code or think that's where I could combine getMode and mean() I don't mind suggestions.如果您对那部分代码有更好的了解,或者认为那是我可以组合getModemean()的地方,我不介意建议。

If you're willing to use a different function to produce the same kind of output, you could use dplyr and tidyr to accomplish this.如果您愿意使用不同的 function 来生成相同类型的 output,则可以使用dplyrtidyr来完成此操作。 Using this approach you could do just what you want with ifelse() to identify numeric or non-numeric variables.使用这种方法,您可以使用ifelse()做您想做的事情来识别数字或非数字变量。 The only thing to note is that if you're having the function produce non-numeric values for for factors, the output for the numeric variables also has to be a character.唯一需要注意的是,如果您让 function 为因子生成非数字值,则数字变量的 output 也必须是一个字符。 That's why I wrapped the mean() function in sprintf() .这就是为什么我将mean() function 包装在sprintf()中。

getMode <- function(df) {
  ux <- na.omit(unique(df))
  ux[which.max(tabulate(match(df, ux)))]
}

library(tidyr)
iris %>% 
  summarise_all(.funs = list(
    mean = function(x)ifelse(is.numeric(x), sprintf("%.3f", mean(x)), as.character(getMode(x))), 
    sd = function(x)ifelse(is.numeric(x), sd(x), sd(as.numeric(x))), 
    min = function(x)ifelse(is.numeric(x), sprintf("%.3f", min(x)), levels(x)[1]), 
    max = function(x)ifelse(is.numeric(x), sprintf("%.3f", max(x)), levels(x)[length(levels(x))]), 
    n = function(x)sum(!is.na(x))
  )) %>% 
  pivot_longer(everything(),
        names_to = c("set", ".value"),
        names_pattern = "(.+)_(.+)")
                            
# A tibble: 5 x 6
#            set  mean     sd   min    max         n
#          <chr> <chr>  <dbl> <chr>  <chr>     <int>
# 1 Sepal.Length 5.843  0.828 4.300  7.900       150
# 2 Sepal.Width  3.057  0.436 2.000  4.400       150
# 3 Petal.Length 3.758  1.77  1.000  6.900       150
# 4 Petal.Width  1.199  0.762 0.100  2.500       150
# 5 Species      setosa 0.819 setosa virginica   150    
#     

This also allows you to make other changes as well - for instance above, I replaced the minimum with the first level of Species and the maximum with the last level of Species .这也允许您进行其他更改 - 例如上面,我用第一级Species替换了最小值,用最后一级Species替换了最大值。 Not that this is necessarily what you'd want to do, but it's easy to change the values of the output based on the type of variable.并不是说这一定是您想要做的,但是很容易根据变量的类型更改 output 的值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM