简体   繁体   English

在汇总中创建多个列

[英]Create multiple columns in summarize

What is the best way to create multiple columns in summarize(...) (or, alternatively, in do(...) )? summarize(...) (或者,在do(...) )创建多个列的最佳方法是什么? This arises if some aggregation function returns more than one value. 如果某些聚合函数返回多个值,则会出现这种情况。 An example of such a function is quantile(...) . 这种功能的一个例子是quantile(...)

For example, suppose we have the following data 例如,假设我们有以下数据

library(dplyr)

data.frame(x = runif(1000, min = 0, max = 20)) %>%
  mutate(y = rnorm(n(), mean = sin(x))) %>%
  group_by(x.category = round(x)) ->
  Z

We can compute (and plot) quantiles easily: 我们可以轻松地计算(和绘制)分位数:

library(ggplot2) # just to display results (not the focus of this question)

Z %>%
  summarize(x = mean(x),
            y25 = quantile(y, probs = .25),
            y50 = quantile(y, probs = .5),
            y75 = quantile(y, probs = .75)) %>%
  gather(Statistic, y, -x, -x.category) %>%
  ggplot(aes(x, y, color = Statistic)) +
  geom_line()

However, the above code has two shortcomings: 1) the quantile(...) code must be duplicated (this would become more tedious if, say, a dozen quantiles were needed), and 2) the column names (y25, y50, y75) might not match the actual quantiles. 但是,上面的代码有两个缺点:1) quantile(...)代码必须重复(如果需要十几个分位数,这将变得更加繁琐),以及2)列名称(y25,y50, y75)可能与实际分位数不匹配。

These problems can be fixed by leveraging the ability of quantile(...) to compute multiple quantiles and return them in a vector with names, as follows: 这些问题可以通过利用quantile(...)计算多个分位数的能力并在带有名称的向量中返回它们来解决,如下所示:

Z %>%
  do(as_data_frame(c(x = mean(.$x),
                     as.list(quantile(.$y, probs = c(.25,.5,.75)))))) %>%
  gather(Statistic, y, -x, -x.category) %>%
  ggplot(aes(x, y, color = Statistic)) +
  geom_line()

However the above code seems ugly to me; 但是上面的代码对我来说似乎很难看; in particular it requires as.list(...) , c(...) , as_data_frame(...) , and do(...) in order to do something pretty simple. 特别是它需要as.list(...)c(...)as_data_frame(...)do(...)才能做一些非常简单的事情。

Is there a better way? 有没有更好的办法?

One possible approach when dealing with functions that return multiple values is creating a string by combining those values and then separate that string into multiple columns using the corresponding names. 处理返回多个值的函数时,一种可能的方法是通过组合这些值来创建字符串,然后使用相应的名称将该字符串分成多个列。

library(dplyr)
library(tidyr)

data.frame(x = runif(1000, min = 0, max = 20)) %>%
  mutate(y = rnorm(n(), mean = sin(x))) %>%
  group_by(x.category = round(x)) ->
  Z

# specify quantiles
q = c(0.25, 0.5, 0.75)

Z %>%
  summarise(x = mean(x),
            qtls = paste(quantile(y, q), collapse = ",")) %>%   # get quantile values as a string
  separate(qtls, paste0("y_", 100*q), sep = ",", convert = T)   # separate quantile values and give corresponding names to columns

# # A tibble: 21 x 5
#   x.category     x   y_25   y_50    y_75
#        <dbl> <dbl>  <dbl>  <dbl>   <dbl>
# 1          0 0.252 -0.596  0.156  0.977 
# 2          1 0.929 -0.191  0.753  1.15  
# 3          2 2.07   0.222  0.787  1.26  
# 4          3 2.95  -0.488  0.303  1.13  
# 5          4 3.92  -1.38  -0.627 -0.0220
# 6          5 4.94  -1.52  -1.08  -0.489 
# 7          6 6.03  -0.950 -0.432  0.492 
# 8          7 6.97  -0.103  0.602  1.32  
# 9          8 7.94   0.350  1.02   1.88  
# 10         9 9.00  -0.155  0.393  1.02  
# # ... with 11 more rows

Inspired by the answer of @AntoniosK here is a solution that also places multiple numbers in a single column, but instead of converting them to a string, stores them in a list column: 受到@AntoniosK答案的启发,这里有一个解决方案,它也可以在一个列中放置多个数字,但不是将它们转换为字符串,而是将它们存储在列表列中:

probs <- c(0.25, 0.5, 0.75)

Z %>%
  summarize(x = mean(x),
            quantile = list(quantile(y, probs)),
            prob = list(probs)) %>%
  unnest() 

To convert the result to a wide format one can follow the above with %>% mutate(prob = sprintf('%g%%', 100*prob)) %>% spread(prob, quantile) (as usual). 要将结果转换为宽格式,可以使用%>% mutate(prob = sprintf('%g%%', 100*prob)) %>% spread(prob, quantile) (如常)来执行上述操作。

One thing I noticed is that unnest(...) ignores the names on the vectors. 我注意到的一件事是, unnest(...)忽略了向量上的名字。 (In fact, I had hoped that the .id parameter would allow me to take advantage of that, but it looks for names on the list not the vectors in the list). (事实上​​,我曾希望.id参数允许我利用它,但它会在列表中查找名称而不是列表中的向量)。 If you really want to use those names, one approach is: 如果您真的想使用这些名称,一种方法是:

library(tibble)

Z %>%
  summarize(x = mean(x),
            quantile = list(enframe(quantile(y)))) %>%
  unnest()

which uses tibble::enframe(...) to capture the names into a column of a tibble. 它使用tibble::enframe(...)将名称捕获到一个tibble::enframe(...)列中。

You could, for example, use the apply family: 例如,您可以使用apply系列:

Z %>%
  sapply(function(x){c(quantile(x, probs = (0:10)/10), mean = mean(x))}) %>%
  data.frame()

#                 x         x.1           y x.category
# 0%    0.001726993  0.00274735 -4.04157670      0.000
# 10%   1.495121921  2.11284993 -1.51783484      1.000
# 20%   3.450423732  4.23374999 -0.92207407      3.000
# 30%   5.366798687  6.13729078 -0.55590328      5.000
# 40%   7.424445083  8.00006315 -0.18782436      7.000
# 50%   9.607056717 10.01599003  0.09847098     10.000
# 60%  11.605829581 11.98377222  0.39765998     12.000
# 70%  13.402578154 13.95268995  0.75339699     13.000
# 80%  15.432076896 16.04652040  1.16335283     15.000
# 90%  17.759217854 17.90820096  1.64737747     18.000
# 100% 19.991569165 19.97475065  3.33769925     20.000
# mean  9.544870438 10.02387573  0.08833454      9.551

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM