dplyr::summarise 內部過濾

Question

在 dplyr::summarise 內部，如何應用基於與我正在總結的行不同的行的過濾器？

例子：

t = data.frame(
  x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
  y = c(1,2,3,4,5,6,7,8,9,10,11,12),
  z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)

t %>%
  dplyr::group_by(x) %>%
  dplyr::summarise(
    mall = mean(y), # this should include all rows in each group
    ma = mean(y), # this should only include rows where z == 1
    mb = mean(y)  # this should only include rows where z == 2
  )

因此，這里的問題是將匯總 function 應用於一列，同時基於另一列進行過濾，全部在summarise內。

一個想法是雙分組，因此在 x 和 z 上都應用group_by ，但我不希望所有匯總列都基於雙分組，一些（如上面示例中的mall ）應該僅基於單分組.

Answer 1

一種快速的選擇是使用ifelse過濾到您需要的行，使 rest 缺失並使用na.rm = T參數忽略缺失值，如下例所示。

    dplyr::group_by(x) %>%
    dplyr::summarise(
        mall = mean(y), # this should include all rows in each group
        ma = mean(ifelse(z == 1, y, NA), na.rm = T), # this should only include rows where z == 1
        mb = mean(ifelse(z == 2, y, NA), na.rm = T)  # this should only include rows where z == 2
    )

# A tibble: 3 x 4
      x  mall    ma    mb
  <dbl> <dbl> <dbl> <dbl>
1     1   2.5     2     3
2     2   6.5     6     7
3     3  10.5    10    11

Answer 2

雖然@Colin H 的答案肯定是 go 對於這個特定示例的方式，但更靈活的解決方法可能是在第一個分組操作的子集中工作。 這可以通過dplyr::group_split加上后續的purrr::map_dfr來實現，但也有dplyr::group_modify可以一步完成。

請注意dplyr::group_modify文檔中的相關句子：

當 summarise() 太有限時使用 group_modify() ，就您需要為每個組執行的操作和返回而言。

因此，這是上面提供的示例的解決方案：

t = data.frame(
  x = c(1,1,1,1,2,2,2,2,3,3, 3, 3),
  y = c(1,2,3,4,5,6,7,8,9,10,11,12),
  z = c(1,2,1,2,1,2,1,2,1,2, 1, 2)
)

t %>%
  dplyr::group_by(x) %>%
  dplyr::group_modify(function(x, ...) {
    x %>% dplyr::mutate(
      mall = mean(y)
    ) %>%
      dplyr::group_by(z, mall) %>%
      dplyr::summarise(
        m = mean(y),
        .groups = "drop"
      )
  }) %>%
  dplyr::ungroup()

# A tibble: 6 x 4
      x     z  mall     m
  <dbl> <dbl> <dbl> <dbl>
1     1     1   2.5     2
2     1     2   2.5     3
3     2     1   6.5     6
4     2     2   6.5     7
5     3     1  10.5    10
6     3     2  10.5    11

group_modify在按x分組后將 function 應用於每個子集 tibble。 這個 function 有兩個 arguments：

該組的數據子集，公開為.x。

鍵是一個 tibble，每個分組變量只有一行和一列，暴露為 as.y。

在我們的 function 中，我們首先使用mutate來覆蓋請求的mall案例。 我們不需要任何進一步的分組，因為包裝group_modify已經涵蓋了這一點。 然后我們應用另一個group_by + summarise來覆蓋z的不同迭代。 請注意，此解決方案與我們要考慮的z中的案例數量無關。 雖然此示例中的兩種情況可以很容易地手動處理，但如果有更多情況，這可能會改變。

如果需要針對z中的情況使用單獨列的寬 output 格式，那么您可以使用tidyr::pivot_wider 。

Answer 3

另一種選擇，也許更簡潔一點是通過子集：

t %>% 
  group_by(x) %>%
  summarise(mall = mean(y), 
            ma = mean(y[z == 1]), 
            mb = mean(y[z == 2]))
# A tibble: 3 x 4
      x  mall    ma    mb
* <dbl> <dbl> <dbl> <dbl>
1     1   2.5     2     3
2     2   6.5     6     7
3     3  10.5    10    11

Answer 4

這是在匯總時對組數據執行自定義過濾的另一種通用方式（就像 group_modify）。 這使用了 dplyr 的上下文相關表達式：cur_data()，它使當前組的數據在 dplyr 動詞中可用，例如 mutate/summary：

t %>%
  dplyr::group_by(x) %>%
  dplyr::summarize(
    mall = mean(y),
    ma   = mean(cur_data() %>% as.data.frame() %>% filter(z == 1) %>% pull(y)),
    mb   = mean(cur_data() %>% as.data.frame() %>% filter(z == 2) %>% pull(y))
  )

使用 cur_data() 的好處是您可以在返回最終摘要之前執行任何復雜的過濾或處理。 有關詳細信息，請參閱： https://dplyr.tidyverse.org/reference/context.html

dplyr::summarise 內部過濾

問題描述

4 個解決方案

解決方案1
2 已采納 2021-02-26 16:01:29

解決方案2
1 2021-02-26 17:49:21

解決方案3
1 2021-02-26 19:02:37

解決方案4
0 2022-01-14 23:52:57

dplyr::summarise 內部過濾

問題描述

4 個解決方案

解決方案1 2 已采納 2021-02-26 16:01:29

解決方案2 1 2021-02-26 17:49:21

解決方案3 1 2021-02-26 19:02:37

解決方案4 0 2022-01-14 23:52:57

解決方案1
2 已采納 2021-02-26 16:01:29

解決方案2
1 2021-02-26 17:49:21

解決方案3
1 2021-02-26 19:02:37

解決方案4
0 2022-01-14 23:52:57