简体   繁体   中英

Create multiple columns in summarize

What is the best way to create multiple columns in summarize(...) (or, alternatively, in do(...) )? This arises if some aggregation function returns more than one value. An example of such a function is quantile(...) .

For example, suppose we have the following data

library(dplyr)

data.frame(x = runif(1000, min = 0, max = 20)) %>%
  mutate(y = rnorm(n(), mean = sin(x))) %>%
  group_by(x.category = round(x)) ->
  Z

We can compute (and plot) quantiles easily:

library(ggplot2) # just to display results (not the focus of this question)

Z %>%
  summarize(x = mean(x),
            y25 = quantile(y, probs = .25),
            y50 = quantile(y, probs = .5),
            y75 = quantile(y, probs = .75)) %>%
  gather(Statistic, y, -x, -x.category) %>%
  ggplot(aes(x, y, color = Statistic)) +
  geom_line()

However, the above code has two shortcomings: 1) the quantile(...) code must be duplicated (this would become more tedious if, say, a dozen quantiles were needed), and 2) the column names (y25, y50, y75) might not match the actual quantiles.

These problems can be fixed by leveraging the ability of quantile(...) to compute multiple quantiles and return them in a vector with names, as follows:

Z %>%
  do(as_data_frame(c(x = mean(.$x),
                     as.list(quantile(.$y, probs = c(.25,.5,.75)))))) %>%
  gather(Statistic, y, -x, -x.category) %>%
  ggplot(aes(x, y, color = Statistic)) +
  geom_line()

However the above code seems ugly to me; in particular it requires as.list(...) , c(...) , as_data_frame(...) , and do(...) in order to do something pretty simple.

Is there a better way?

One possible approach when dealing with functions that return multiple values is creating a string by combining those values and then separate that string into multiple columns using the corresponding names.

library(dplyr)
library(tidyr)

data.frame(x = runif(1000, min = 0, max = 20)) %>%
  mutate(y = rnorm(n(), mean = sin(x))) %>%
  group_by(x.category = round(x)) ->
  Z

# specify quantiles
q = c(0.25, 0.5, 0.75)

Z %>%
  summarise(x = mean(x),
            qtls = paste(quantile(y, q), collapse = ",")) %>%   # get quantile values as a string
  separate(qtls, paste0("y_", 100*q), sep = ",", convert = T)   # separate quantile values and give corresponding names to columns

# # A tibble: 21 x 5
#   x.category     x   y_25   y_50    y_75
#        <dbl> <dbl>  <dbl>  <dbl>   <dbl>
# 1          0 0.252 -0.596  0.156  0.977 
# 2          1 0.929 -0.191  0.753  1.15  
# 3          2 2.07   0.222  0.787  1.26  
# 4          3 2.95  -0.488  0.303  1.13  
# 5          4 3.92  -1.38  -0.627 -0.0220
# 6          5 4.94  -1.52  -1.08  -0.489 
# 7          6 6.03  -0.950 -0.432  0.492 
# 8          7 6.97  -0.103  0.602  1.32  
# 9          8 7.94   0.350  1.02   1.88  
# 10         9 9.00  -0.155  0.393  1.02  
# # ... with 11 more rows

Inspired by the answer of @AntoniosK here is a solution that also places multiple numbers in a single column, but instead of converting them to a string, stores them in a list column:

probs <- c(0.25, 0.5, 0.75)

Z %>%
  summarize(x = mean(x),
            quantile = list(quantile(y, probs)),
            prob = list(probs)) %>%
  unnest() 

To convert the result to a wide format one can follow the above with %>% mutate(prob = sprintf('%g%%', 100*prob)) %>% spread(prob, quantile) (as usual).

One thing I noticed is that unnest(...) ignores the names on the vectors. (In fact, I had hoped that the .id parameter would allow me to take advantage of that, but it looks for names on the list not the vectors in the list). If you really want to use those names, one approach is:

library(tibble)

Z %>%
  summarize(x = mean(x),
            quantile = list(enframe(quantile(y)))) %>%
  unnest()

which uses tibble::enframe(...) to capture the names into a column of a tibble.

You could, for example, use the apply family:

Z %>%
  sapply(function(x){c(quantile(x, probs = (0:10)/10), mean = mean(x))}) %>%
  data.frame()

#                 x         x.1           y x.category
# 0%    0.001726993  0.00274735 -4.04157670      0.000
# 10%   1.495121921  2.11284993 -1.51783484      1.000
# 20%   3.450423732  4.23374999 -0.92207407      3.000
# 30%   5.366798687  6.13729078 -0.55590328      5.000
# 40%   7.424445083  8.00006315 -0.18782436      7.000
# 50%   9.607056717 10.01599003  0.09847098     10.000
# 60%  11.605829581 11.98377222  0.39765998     12.000
# 70%  13.402578154 13.95268995  0.75339699     13.000
# 80%  15.432076896 16.04652040  1.16335283     15.000
# 90%  17.759217854 17.90820096  1.64737747     18.000
# 100% 19.991569165 19.97475065  3.33769925     20.000
# mean  9.544870438 10.02387573  0.08833454      9.551

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM