简体   繁体   中英

Data framing summary statistics in R

I need to create a XLSX file containing the summary statistics (as in the summary() function), but I am not being able to create a reliable way to separate each value (mean, median, NA's etc.) into separate rows for each variable from the original variables. Since my database has more than 200 variables, I do need to create a more systematic way, instead of manually deleting words in my XLSX output.

After some research, I found some partial solutions, such as:

x1 <- as.data.frame(do.call(cbind, lapply(df, summary, is.numeric)))
x2 <- data.frame(unclass(summary(df1)), check.names = FALSE, stringsAsFactors = FALSE)
x3 <- as.data.frame(apply(df,2,summary))
x4 <- data.frame(df1=matrix(df1),row.names=names(df1))

And what I need is something like this:

          y1      y2      y3       y4       y5
Min.    1.00    1.00    23.00    50.00    6.00
1st Qu. 31.75   3.75    30.50    57.25    11.75
Median  43.00   7.00    56.00    76.00    15.00
Mean    51.75   6.10    55.55    72.05    14.35
3rd Qu. 80.25   8.25    73.50    83.75    17.00
Max.    99.00   10.00    100.00  95.00    20.00

If someone would like to do some exercise, this database gives the same errors as my huge one:

x1 <- rpois(20,5)
x2 <- rexp(20,2)
x3 <- rexp(20,5); x3[1:10] <- NA_real_
x4 <- runif(20,5,10)
x5 <- runif(20,5,12)
df1 <- data.frame(x1,x2,x3,x4,x5)

Thanks in advance!

considering an example dataframe with columns y1, y2, ..., yn to summarise:

library(tidyr)
library(dplyr)

data.frame(y1 = rnorm(100),
           y2 = runif(100) ##, ... yn
           ) %>%
pivot_longer(starts_with('y'),
             names_to = 'variable',
             values_to = 'value'
             ) %>%
    group_by(variable) %>%
    summarise(Min = min(value, na.rm = TRUE),
              Median = median(value, na.rm = TRUE) ##, ad libidum
              ) %>%
    pivot_longer(-variable) %>%
    pivot_wider(names_from = variable)

Generally, package {broom} offers convenient tidy ing of summaries into tibbles:

library(broom)
summary(1:10) %>% tidy
lm(displ ~ cyl, data = mpg) %>% tidy

or, if you want wide instead of long table format (as in your example):

library(broom)
library(tidyr)

summary(1:10) %>%
    tidy %>%
    pivot_longer(everything(),
                 names_to = 'stat',
                 values_to = 'value'
                 )

Consider casting summary results to data.frame , cleaning the columns, then reshape the output:

summary_raw <- summary(df1)

# SPLIT Freq COLUMN AND SUBSET OUT NA ROWS
summary_long <- within(
  data.frame(summary_raw), {
    Var2 <- trimws(Var2)
    Agg <- trimws(sapply(strsplit(Freq, ':'), "[", 1))
    Num <- as.numeric(sapply(strsplit(Freq, ':'), "[", 2))
    rm(Var1, Freq)
  }
) |> subset(
  !is.na(Agg) & !is.na(Num)
)

# RESHAPE TO WIDE
summary_wide <- reshape(
  summary_long,
  idvar = "Agg",
  v.names = "Num",
  timevar = "Var2",
  direction = "wide",
) |> `row.names<-`(NULL)

colnames(summary_wide) <- gsub(
    "Num\\.", "", names(summary_wide)
)

Input

set.seed(43022)

x1 <- rpois(20,5)
x2 <- rexp(20,2)
x3 <- rexp(20,5); x3[1:10] <- NA_real_
x4 <- runif(20,5,10)
x5 <- runif(20,5,12)
df1 <- data.frame(x1,x2,x3,x4,x5)

Output

> summary_wide
      Agg    x1       x2        x3    x4     x5
1    Min.  1.00 0.003004  0.009565 5.034  6.240
2 1st Qu.  3.00 0.086428  0.020734 6.903  7.323
3  Median  4.00 0.279303  0.035791 7.829  9.492
4    Mean  4.85 0.323793  0.098930 7.780  9.125
5 3rd Qu.  6.25 0.548857  0.067267 8.622 10.685
6    Max. 12.00 0.928066  0.523284 9.908 11.867
7    NA's    NA       NA 10.000000    NA     NA

Here a one-liner.

lapply(df1, summary) |> lapply(`length<-`, 6) |>  do.call(what=rbind) |> t() |> round(2)
#           x1   x2   x3   x4    x5
# Min.    1.00 0.03 0.03 5.23  5.48
# 1st Qu. 2.75 0.26 0.11 6.51  6.85
# Median  4.00 0.56 0.20 8.25  8.29
# Mean    4.55 0.57 0.24 7.94  8.29
# 3rd Qu. 6.00 0.70 0.28 9.43  9.57
# Max.    9.00 1.94 0.82 9.79 11.78

Just use summary in an lapply , adapt the length s to 6 to remove the NA display, rbind , t ranspose and round it. Works for numeric data as in your example.

Note: R >= 4.1 used.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM