使用 dplyr 匯總對多列進行不同的操作

Question

好吧，我知道已經有很多相關的問題，但沒有一個能滿足我的特殊需求。

我想在一個有 50 列的表上使用 dplyr “summarize”，我需要對這些應用不同的匯總函數。

"Summarize_all" 和 "summarize_at" 似乎都有一個缺點，即無法將不同的函數應用於不同的變量子組。

例如，假設 iris 數據集有 50 列，因此我們不想按名稱對列進行尋址。 我想要前兩列的總和，第三列的平均值和所有剩余列的第一個值（在 group_by(Species) 之后）。 我怎么能這樣做？

Answer 1

幸運的是，現在有一種更簡單的方法可用。 隨着新的dplyr 1.0.0即將推出，您可以為此目的利用across功能。

您只需要輸入：

iris %>% 
  group_by(Species) %>% 
  summarize(
    # I want the sum over the first two columns, 
    across(c(1,2), sum),
    #  the mean over the third 
    across(3, mean),
    # the first value for all remaining columns (after a group_by(Species))
    across(-c(1:3), first)
  )

很棒，不是嗎？ 我首先想到的跨越是不是必要的，因為范圍的變種工作得很好，但是這個用例就是為什么在across功能是非常有利的。

您可以通過devtools::install_github("tidyverse/dplyr")獲取最新版本的 dplyr

Answer 2

正如其他人所提到的，這通常是通過為要應用匯總函數的每一組列調用summarize_each _每個/ summarize_at / summarize_if _if來完成的。 據我所知，您必須創建一個自定義函數來對每個子集進行匯總。 例如，您可以以這樣的方式設置列名，以便您可以使用選擇助手（例如contains() ）來過濾您想要應用該函數的列。 如果沒有，那么您可以設置要匯總的特定列號。

對於您提到的示例，您可以嘗試以下操作：

summarizer <- function(tb, colsone, colstwo, colsthree, 
                       funsone, funstwo, funsthree, group_name) {

return(bind_cols(
    summarize_all(select(tb, colsone), .funs = funsone),
    summarize_all(select(tb, colstwo), .funs = funstwo) %>% 
      ungroup() %>% select(-matches(group_name)),
    summarize_all(select(tb, colsthree), .funs = funsthree) %>% 
      ungroup() %>% select(-matches(group_name)) 
))

}

#With colnames
iris %>% as.tibble() %>% 
  group_by(Species) %>% 
  summarizer(colsone = contains("Sepal"), 
         colstwo = matches("Petal.Length"), 
         colsthree = c(-contains("Sepal"), -matches("Petal.Length")),
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

#With indexes
iris %>% as.tibble() %>% 
 group_by(Species) %>% 
 summarizer(colsone = 1:2, 
         colstwo = 3, 
         colsthree = 4,
         funsone = "sum", 
         funstwo = "mean",
         funsthree = "first",
         group_name = "Species")

Answer 3

您可以單獨匯總每個函數的數據，然后在需要時加入數據。

所以對於鳶尾花的例子是這樣的：

sums <- iris %>% group_by(Species) %>% summarise_at(1:2, sum)
means <- iris %>% group_by(Species) %>% summarise_at(3, mean)
firsts <- iris %>% group_by(Species) %>% summarise_at(4, first)
full_join(sums, means) %>% full_join(firsts)

如果您需要使用的匯總函數不止少數，我會嘗試考慮其他方法。

Answer 4

試試這個：

library(plyr)
library(dplyr)

dataframe <- data.frame(var = c(1,1,1,2,2,2),var2 = c(10,9,8,7,6,5),var3=c(2,3,4,5,6,7),var4=c(5,5,3,2,4,2))
dataframe

#  var var2 var3 var4
#1   1   10    2    5
#2   1    9    3    5
#3   1    8    4    3
#4   2    7    5    2
#5   2    6    6    4
#6   2    5    7    2

funnames<-c(sum,mean,first)
colnums<-c(2,3,4)
ddply(.data = dataframe,.variables = "var",
    function(x,funcs,inds){
        mapply(function(func,ind){
            func(x[,ind])
        },funcs,inds)
    },funnames,colnums)

#  var V1 V2 V3
#1   1 27  3  5
#2   2 18  6  2

Answer 5

看到這個- 即將推出的功能

使用 dplyr 匯總對多列進行不同的操作

問題描述

5 個解決方案

解決方案1
6 2020-05-20 09:03:13

解決方案2
5 已采納 2018-02-28 14:53:07

解決方案3
1

解決方案4
0 2018-02-23 09:40:03

解決方案5
0 2020-05-08 00:08:29

使用 dplyr 匯總對多列進行不同的操作

問題描述

5 個解決方案

解決方案1 6 2020-05-20 09:03:13

解決方案2 5 已采納 2018-02-28 14:53:07

解決方案3 1

解決方案4 0 2018-02-23 09:40:03

解決方案5 0 2020-05-08 00:08:29

解決方案1
6 2020-05-20 09:03:13

解決方案2
5 已采納 2018-02-28 14:53:07

解決方案3
1

解決方案4
0 2018-02-23 09:40:03

解決方案5
0 2020-05-08 00:08:29