简体   繁体   English

R dplyr 将多个函数汇总到选定的变量

[英]R dplyr summarise multiple functions to selected variables

I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.我有一个数据集,我想按平均值对其进行汇总,但也计算最大值为 1 个变量。

Let me start with an example of what I would like to achieve:让我从一个我想要实现的例子开始:

iris %>%
  group_by(Species) %>%
  filter(Sepal.Length > 5) %>%
  summarise_at("Sepal.Length:Petal.Width",funs(mean))

which give me the following result这给了我以下结果

# A tibble: 3 × 5
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width
      <fctr>        <dbl>       <dbl>        <dbl>       <dbl>
1     setosa          5.8         4.4          1.9         0.5
2 versicolor          7.0         3.4          5.1         1.8
3  virginica          7.9         3.8          6.9         2.5

Is there an easy way to add, for example, max(Petal.Width) to summarise?有没有一种简单的方法可以添加,例如max(Petal.Width)来总结?

So far I have tried the following:到目前为止,我已经尝试了以下方法:

iris %>%
  group_by(Species) %>%
  filter(Sepal.Length > 5) %>%
  summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
  mutate(Max.Petal.Width = max(iris$Petal.Width))

But with this approach I lose both the group_by and the filter from the code above and gives the wrong results.但是使用这种方法,我丢失了上面代码中的group_byfilter ,并给出了错误的结果。

The only solution I have been able to achieve is the following:我能够实现的唯一解决方案如下:

iris %>%
  group_by(Species) %>%
  filter(Sepal.Length > 5) %>%
  summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
  select(Species:Petal.Width_mean,Petal.Width_max) %>% 
  rename(Max.Petal.Width = Petal.Width_max) %>%
  rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))

Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.这有点令人费解,需要进行大量输入才能添加具有不同摘要的列。

Thank you谢谢

Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.尽管这是一个老问题,但它仍然是一个有趣的问题,我有两个解决方案,我相信找到此页面的任何人都应该可以使用它们。

Solution one解决方案一

My own take:我自己的看法:

mapply(summarise_at, 
       .vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"), 
       .funs = lst(mean, max), 
       MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5))) 
%>% reduce(merge, by = "Species")

    #         Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
    #    1     setosa        5.314       3.714        1.509        0.2773           0.5
    #    2 versicolor        5.998       2.804        4.317        1.3468           1.8
    #    3  virginica        6.622       2.984        5.573        2.0327           2.5

Solution two解决方案二

An elegant solution using package purrr from the tidyverse itself, inspired by this discussion :受此讨论启发,使用 tidyverse 本身的包purrr的优雅解决方案:

list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
     .funs = lst("mean" = mean, "max" = max)) %>% 
      pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y)) 
      %>% reduce(inner_join, by = "Species")

+ + + # A tibble: 3 x 6
  Species    Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
  <fct>             <dbl>       <dbl>        <dbl>         <dbl>         <dbl>
1 setosa             5.31        3.71         1.51         0.277           0.5
2 versicolor         6.00        2.80         4.32         1.35            1.8
3 virginica          6.62        2.98         5.57         2.03            2.5

Short discussion简短讨论

The data.frame and tibble are the desired result, the last column being the max of petal.width and the other ones the means (by group and filter) of all other columns. data.frame 和 tibble 是所需的结果,最后一列是petal.widthmax ,另一列是所有其他列的平均值(按组和过滤器)。

Both solutions hinge on three realizations:这两种解决方案都取决于三个实现:

  1. summarise_at accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. summarise_at接受两个列表作为参数, n 个变量之一和m 个函数之一,并将所有m 个函数应用于所有n 个变量,因此在小标题中生成m X n 个向量。 The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!因此,该解决方案可能意味着强制此函数以某种方式在由我们希望应用一个特定函数的所有变量和一个函数,然后是另一组变量和它们自己的函数,等等所形成的“对”之间循环!
  2. Now, what does the above in R?现在,R 中的上述内容是什么? What does force an operation to corresponding elements of two lists?什么强制对两个列表的相应元素进行操作? Functions such as mapply or the family of functions map2 , pmap and variations thereof from dplyr 's tidyverse fellow purrr .诸如mapply之类的函数或函数家族map2pmap及其变体来自dplyr的 tidyverse 伙伴purrr Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.两者都接受两个包含l 个元素的列表,并对两个列表的对应元素(按位置匹配)执行给定操作。
  3. Because the product is not a tibble or a data.frame, but a list, you simply need to use reduce with inner_join or just merge .因为产品不是 tibble 或 data.frame,而是一个列表,您只需要使用reduceinner_join或只是merge

Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris dataset?).请注意,我获得的平均值与 OP 的平均值不同,但它们也是我通过他的可重复示例获得的平均值(也许我们有两个不同版本的iris数据集?)。

If you wanted to do something more complex like that, you could write your own version of summarize_at .如果你想做一些更复杂的事情,你可以编写自己的summarize_at版本。 With this version you supply triplets of column names, functions, and naming rules.使用此版本,您可以提供列名、函数和命名规则的三元组。 For example例如

Here's a rough start这是一个艰难的开始

my_summarise_at<-function (.tbl, ...) 
{
    dots <- list(...)
    stopifnot(length(dots)%%3==0)
    vars <- do.call("append", Map(function(.cols, .funs, .name) {
        cols <- select_colwise_names(.tbl, .cols)
        funs <- as.fun_list(.funs, .env = parent.frame())
        val<-colwise_(.tbl, funs, cols)
        names <- sapply(names(val), function(x) gsub("%", x, .name))
        setNames(val, names)
    }, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
    summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")

And you can call it with你可以用

iris %>%
  group_by(Species) %>%
  filter(Sepal.Length > 5) %>%
  my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean", 
      "Petal.Width", max, "%_max")

For the names we just replace the "%" with the default name.对于名称,我们只是将“%”替换为默认名称。 The idea is just to dynamically build the summarize_ expression.这个想法只是动态构建summarize_表达式。 The summarize_at function is really just a convenience wrapper around that basic function. summarize_at函数实际上只是该基本函数的一个方便的包装器。

I was looking for something similar and tried the following.我正在寻找类似的东西并尝试了以下方法。 It works well and much easier to read than the suggested solutions.与建议的解决方案相比,它运行良好且易于阅读。

iris %>% 
group_by(Species) %>%
filter(Sepal.Length > 5) %>% 
summarise(MeanSepalLength=mean(Sepal.Length), 
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width), 
MaxPetalWidth=max(Petal.Width))

# A tibble: 3 x 6
Species    MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct>                <dbl>          <dbl>           <dbl>          <dbl>         <dbl>
1 setosa                5.01           3.43            1.46          0.246           0.6
2 versicolor            5.94           2.77            4.26          1.33            1.8
3 virginica             6.59           2.97            5.55          2.03            2.5

In summarise() part, define your column name and give your column to summarise inside your function of choice.在 summarise() 部分,定义您的列名并让您的列在您选择的函数内进行汇总。

If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across function which will be available from dplyr 1.0.0 .如果你正在尝试做的一切dplyr(这可能是更容易记住),那么你就可以利用新的across功能,将可从dplyr 1.0.0

iris %>%
  group_by(Species) %>%
  filter(Sepal.Length > 5) %>% 
  summarize(across(Sepal.Length:Petal.Width, mean)) %>% 
  cbind(iris %>% 
          group_by(Species) %>% 
          summarize(across(Petal.Width, max)) %>% 
          select(-Species)
  )

It shows that the only difficulty is to combine two calculations on the same column Petal.Width on a grouped variable - you have to do the grouping again but can nest it into the cbind .它表明唯一的困难是在分组变量的同一列Petal.Width上组合两个计算 - 您必须再次进行分组但可以将其嵌套到cbind This returns correctly the result:这将正确返回结果:

     Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1     setosa     5.313636    3.713636     1.509091   0.2772727         0.6
2 versicolor     5.997872    2.804255     4.317021   1.3468085         1.8
3  virginica     6.622449    2.983673     5.573469   2.0326531         2.5

If the task would not specify two calculations but only one on the same column Petal.Width , then this could be elegantly written as:如果任务不会指定两个计算,而只在同一列Petal.Width上指定一个,那么这可以优雅地写为:

iris %>%
  group_by(Species) %>%
  filter(Sepal.Length > 5) %>% 
  summarize(
    across(Sepal.Length:Petal.Length, mean),
    across(Petal.Width, max)
  )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM