What is the best way to create multiple columns in summarize(...)
(or, alternatively, in do(...)
)? This arises if some aggregation function returns more than one value. An example of such a function is quantile(...)
.
For example, suppose we have the following data
library(dplyr)
data.frame(x = runif(1000, min = 0, max = 20)) %>%
mutate(y = rnorm(n(), mean = sin(x))) %>%
group_by(x.category = round(x)) ->
Z
We can compute (and plot) quantiles easily:
library(ggplot2) # just to display results (not the focus of this question)
Z %>%
summarize(x = mean(x),
y25 = quantile(y, probs = .25),
y50 = quantile(y, probs = .5),
y75 = quantile(y, probs = .75)) %>%
gather(Statistic, y, -x, -x.category) %>%
ggplot(aes(x, y, color = Statistic)) +
geom_line()
However, the above code has two shortcomings: 1) the quantile(...)
code must be duplicated (this would become more tedious if, say, a dozen quantiles were needed), and 2) the column names (y25, y50, y75) might not match the actual quantiles.
These problems can be fixed by leveraging the ability of quantile(...)
to compute multiple quantiles and return them in a vector with names, as follows:
Z %>%
do(as_data_frame(c(x = mean(.$x),
as.list(quantile(.$y, probs = c(.25,.5,.75)))))) %>%
gather(Statistic, y, -x, -x.category) %>%
ggplot(aes(x, y, color = Statistic)) +
geom_line()
However the above code seems ugly to me; in particular it requires as.list(...)
, c(...)
, as_data_frame(...)
, and do(...)
in order to do something pretty simple.
Is there a better way?
One possible approach when dealing with functions that return multiple values is creating a string by combining those values and then separate that string into multiple columns using the corresponding names.
library(dplyr)
library(tidyr)
data.frame(x = runif(1000, min = 0, max = 20)) %>%
mutate(y = rnorm(n(), mean = sin(x))) %>%
group_by(x.category = round(x)) ->
Z
# specify quantiles
q = c(0.25, 0.5, 0.75)
Z %>%
summarise(x = mean(x),
qtls = paste(quantile(y, q), collapse = ",")) %>% # get quantile values as a string
separate(qtls, paste0("y_", 100*q), sep = ",", convert = T) # separate quantile values and give corresponding names to columns
# # A tibble: 21 x 5
# x.category x y_25 y_50 y_75
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 0 0.252 -0.596 0.156 0.977
# 2 1 0.929 -0.191 0.753 1.15
# 3 2 2.07 0.222 0.787 1.26
# 4 3 2.95 -0.488 0.303 1.13
# 5 4 3.92 -1.38 -0.627 -0.0220
# 6 5 4.94 -1.52 -1.08 -0.489
# 7 6 6.03 -0.950 -0.432 0.492
# 8 7 6.97 -0.103 0.602 1.32
# 9 8 7.94 0.350 1.02 1.88
# 10 9 9.00 -0.155 0.393 1.02
# # ... with 11 more rows
Inspired by the answer of @AntoniosK here is a solution that also places multiple numbers in a single column, but instead of converting them to a string, stores them in a list column:
probs <- c(0.25, 0.5, 0.75)
Z %>%
summarize(x = mean(x),
quantile = list(quantile(y, probs)),
prob = list(probs)) %>%
unnest()
To convert the result to a wide format one can follow the above with %>% mutate(prob = sprintf('%g%%', 100*prob)) %>% spread(prob, quantile)
(as usual).
One thing I noticed is that unnest(...)
ignores the names on the vectors. (In fact, I had hoped that the .id
parameter would allow me to take advantage of that, but it looks for names on the list not the vectors in the list). If you really want to use those names, one approach is:
library(tibble)
Z %>%
summarize(x = mean(x),
quantile = list(enframe(quantile(y)))) %>%
unnest()
which uses tibble::enframe(...)
to capture the names into a column of a tibble.
You could, for example, use the apply family:
Z %>%
sapply(function(x){c(quantile(x, probs = (0:10)/10), mean = mean(x))}) %>%
data.frame()
# x x.1 y x.category
# 0% 0.001726993 0.00274735 -4.04157670 0.000
# 10% 1.495121921 2.11284993 -1.51783484 1.000
# 20% 3.450423732 4.23374999 -0.92207407 3.000
# 30% 5.366798687 6.13729078 -0.55590328 5.000
# 40% 7.424445083 8.00006315 -0.18782436 7.000
# 50% 9.607056717 10.01599003 0.09847098 10.000
# 60% 11.605829581 11.98377222 0.39765998 12.000
# 70% 13.402578154 13.95268995 0.75339699 13.000
# 80% 15.432076896 16.04652040 1.16335283 15.000
# 90% 17.759217854 17.90820096 1.64737747 18.000
# 100% 19.991569165 19.97475065 3.33769925 20.000
# mean 9.544870438 10.02387573 0.08833454 9.551
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.