[英]R dplyr summarise multiple functions to selected variables
I have a dataset for which I want to summarise by mean, but also calculate the max to just 1 of the variables.我有一个数据集,我想按平均值对其进行汇总,但也计算最大值为 1 个变量。
Let me start with an example of what I would like to achieve:让我从一个我想要实现的例子开始:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean))
which give me the following result这给了我以下结果
# A tibble: 3 × 5
Species Sepal.Length Sepal.Width Petal.Length Petal.Width
<fctr> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.8 4.4 1.9 0.5
2 versicolor 7.0 3.4 5.1 1.8
3 virginica 7.9 3.8 6.9 2.5
Is there an easy way to add, for example, max(Petal.Width)
to summarise?有没有一种简单的方法可以添加,例如max(Petal.Width)
来总结?
So far I have tried the following:到目前为止,我已经尝试了以下方法:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean)) %>%
mutate(Max.Petal.Width = max(iris$Petal.Width))
But with this approach I lose both the group_by
and the filter
from the code above and gives the wrong results.但是使用这种方法,我丢失了上面代码中的group_by
和filter
,并给出了错误的结果。
The only solution I have been able to achieve is the following:我能够实现的唯一解决方案如下:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise_at("Sepal.Length:Petal.Width",funs(mean,max)) %>%
select(Species:Petal.Width_mean,Petal.Width_max) %>%
rename(Max.Petal.Width = Petal.Width_max) %>%
rename_(.dots = setNames(names(.), gsub("_.*$","",names(.))))
Which is a bit convoluted and involves a lot of typing to just add a column with a different summarisation.这有点令人费解,需要进行大量输入才能添加具有不同摘要的列。
Thank you谢谢
Although this is an old question, it remains an interesting problem for which I have two solutions that I believe should be available to whoever finds this page.尽管这是一个老问题,但它仍然是一个有趣的问题,我有两个解决方案,我相信找到此页面的任何人都应该可以使用它们。
Solution one解决方案一
My own take:我自己的看法:
mapply(summarise_at,
.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst(mean, max),
MoreArgs = list(.tbl = iris %>% group_by(Species) %>% filter(Sepal.Length > 5)))
%>% reduce(merge, by = "Species")
# Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
# 1 setosa 5.314 3.714 1.509 0.2773 0.5
# 2 versicolor 5.998 2.804 4.317 1.3468 1.8
# 3 virginica 6.622 2.984 5.573 2.0327 2.5
Solution two解决方案二
An elegant solution using package purrr
from the tidyverse itself, inspired by this discussion :受此讨论启发,使用 tidyverse 本身的包purrr
的优雅解决方案:
list(.vars = lst(names(iris)[!names(iris)%in%"Species"], "Petal.Width"),
.funs = lst("mean" = mean, "max" = max)) %>%
pmap(~ iris %>% group_by(Species) %>% filter(Sepal.Length > 5) %>% summarise_at(.x, .y))
%>% reduce(inner_join, by = "Species")
+ + + # A tibble: 3 x 6
Species Sepal.Length Sepal.Width Petal.Length Petal.Width.x Petal.Width.y
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.31 3.71 1.51 0.277 0.5
2 versicolor 6.00 2.80 4.32 1.35 1.8
3 virginica 6.62 2.98 5.57 2.03 2.5
Short discussion简短讨论
The data.frame and tibble are the desired result, the last column being the max
of petal.width
and the other ones the means (by group and filter) of all other columns. data.frame 和 tibble 是所需的结果,最后一列是petal.width
的max
,另一列是所有其他列的平均值(按组和过滤器)。
Both solutions hinge on three realizations:这两种解决方案都取决于三个实现:
summarise_at
accepts as arguments two lists, one of n variables and one of m functions, and applies all m functions to all n variables, therefore producing m X n vectors in a tibble. summarise_at
接受两个列表作为参数, n 个变量之一和m 个函数之一,并将所有m 个函数应用于所有n 个变量,因此在小标题中生成m X n 个向量。 The solution might thus imply forcing this function to loop in some way across "couples" formed by all variables to which we want one specific function to be applied and the one function, then another group of variables and their own function, and so on!因此,该解决方案可能意味着强制此函数以某种方式在由我们希望应用一个特定函数的所有变量和一个函数,然后是另一组变量和它们自己的函数,等等所形成的“对”之间循环!mapply
or the family of functions map2
, pmap
and variations thereof from dplyr
's tidyverse fellow purrr
.诸如mapply
之类的函数或函数家族map2
、 pmap
及其变体来自dplyr
的 tidyverse 伙伴purrr
。 Both accept two lists of l elements and perform a given operation on corresponding elements (matched by position) of the two lists.两者都接受两个包含l 个元素的列表,并对两个列表的对应元素(按位置匹配)执行给定操作。reduce
with inner_join
or just merge
.因为产品不是 tibble 或 data.frame,而是一个列表,您只需要使用reduce
和inner_join
或只是merge
。 Note that the means I obtain are different from those of the OP, but they are the means I obtain with his reproducible example as well (maybe we have two different versions of the iris
dataset?).请注意,我获得的平均值与 OP 的平均值不同,但它们也是我通过他的可重复示例获得的平均值(也许我们有两个不同版本的iris
数据集?)。
If you wanted to do something more complex like that, you could write your own version of summarize_at
.如果你想做一些更复杂的事情,你可以编写自己的summarize_at
版本。 With this version you supply triplets of column names, functions, and naming rules.使用此版本,您可以提供列名、函数和命名规则的三元组。 For example例如
Here's a rough start这是一个艰难的开始
my_summarise_at<-function (.tbl, ...)
{
dots <- list(...)
stopifnot(length(dots)%%3==0)
vars <- do.call("append", Map(function(.cols, .funs, .name) {
cols <- select_colwise_names(.tbl, .cols)
funs <- as.fun_list(.funs, .env = parent.frame())
val<-colwise_(.tbl, funs, cols)
names <- sapply(names(val), function(x) gsub("%", x, .name))
setNames(val, names)
}, dots[seq_along(dots)%%3==1], dots[seq_along(dots)%%3==2], dots[seq_along(dots)%%3==0]))
summarise_(.tbl, .dots = vars)
}
environment(my_summarise_at)<-getNamespace("dplyr")
And you can call it with你可以用
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
my_summarise_at("Sepal.Length:Petal.Width", mean, "%_mean",
"Petal.Width", max, "%_max")
For the names we just replace the "%" with the default name.对于名称,我们只是将“%”替换为默认名称。 The idea is just to dynamically build the summarize_
expression.这个想法只是动态构建summarize_
表达式。 The summarize_at
function is really just a convenience wrapper around that basic function. summarize_at
函数实际上只是该基本函数的一个方便的包装器。
I was looking for something similar and tried the following.我正在寻找类似的东西并尝试了以下方法。 It works well and much easier to read than the suggested solutions.与建议的解决方案相比,它运行良好且易于阅读。
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarise(MeanSepalLength=mean(Sepal.Length),
MeanSepalWidth = mean(Sepal.Width),
MeanPetalLength=mean(Petal.Length),
MeanPetalWidth=mean(Petal.Width),
MaxPetalWidth=max(Petal.Width))
# A tibble: 3 x 6
Species MeanSepalLength MeanSepalWidth MeanPetalLength MeanPetalWidth MaxPetalWidth
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.01 3.43 1.46 0.246 0.6
2 versicolor 5.94 2.77 4.26 1.33 1.8
3 virginica 6.59 2.97 5.55 2.03 2.5
In summarise() part, define your column name and give your column to summarise inside your function of choice.在 summarise() 部分,定义您的列名并让您的列在您选择的函数内进行汇总。
If you are trying to do everything with dplyr (which might be easier to remember), then you can leverage the new across
function which will be available from dplyr 1.0.0 .如果你正在尝试做的一切dplyr(这可能是更容易记住),那么你就可以利用新的across
功能,将可从dplyr 1.0.0 。
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(across(Sepal.Length:Petal.Width, mean)) %>%
cbind(iris %>%
group_by(Species) %>%
summarize(across(Petal.Width, max)) %>%
select(-Species)
)
It shows that the only difficulty is to combine two calculations on the same column Petal.Width
on a grouped variable - you have to do the grouping again but can nest it into the cbind
.它表明唯一的困难是在分组变量的同一列Petal.Width
上组合两个计算 - 您必须再次进行分组但可以将其嵌套到cbind
。 This returns correctly the result:这将正确返回结果:
Species Sepal.Length Sepal.Width Petal.Length Petal.Width Petal.Width
1 setosa 5.313636 3.713636 1.509091 0.2772727 0.6
2 versicolor 5.997872 2.804255 4.317021 1.3468085 1.8
3 virginica 6.622449 2.983673 5.573469 2.0326531 2.5
If the task would not specify two calculations but only one on the same column Petal.Width
, then this could be elegantly written as:如果任务不会指定两个计算,而只在同一列Petal.Width
上指定一个,那么这可以优雅地写为:
iris %>%
group_by(Species) %>%
filter(Sepal.Length > 5) %>%
summarize(
across(Sepal.Length:Petal.Length, mean),
across(Petal.Width, max)
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.