在dplyr中的行組中變異列

Question

我正在嘗試創建按不同列分組的新列，但我不確定我這樣做是否是使用group_by的最佳方式。 我想知道是否有一種方法可以將group_by排成一行？

我知道可以使用data.table包來完成，其中語法是DT [i，j，by]類型。

但由於這是一個較大的代碼中的一小部分，它使用tidyverse並且工作得很好，我只是不想偏離它。

## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10) 
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-") 
customers <- sample.int(50:100,50) 
sales <- sample.int(500:5000,50)

df <- bind_cols(data.frame(state, county,customers,sales))

## workflow

df2 <- df %>%
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales)) %>% 
  ungroup %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers),
         saleInCounty = sum(sales)) %>% 
  ungroup %>% 
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

我希望我的代碼看起來像

df3 <- df %>%
  mutate(customerInState = sum(customers, by = state),
         saleInState = sum(sales, by = state),
         customerInCounty = sum(customers, by = county),
         saleInCounty = sum(sales, by = county),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState,
         minSale = min(salePerCountyPercent, by = state))

它運行沒有錯誤，但我知道輸出不對

我知道有可能在變異中玩雜耍以獲得我需要的更少量的group_bys。 但問題是，如果在dplyr中有線組進行

Answer 1

您可以創建包裝器來執行您想要的操作。 如果您有一個分組變量，則此特定解決方案有效。 祝好運！

library(tidyverse)

mutate_by <- function(.data, group, ...) {

  group_by(.data, !!enquo(group)) %>%
    mutate(...) %>%
    ungroup

}

df1 <- df %>%
  mutate_by(state, 
            customerInState = sum(customers),
            saleInState = sum(sales)) %>%
  mutate_by(county,
            customerInCounty = sum(customers),
            saleInCounty = sum(sales)) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(state,
            minSale = min(salePerCountyPercent))

identical(df2, df1)
[1] TRUE

編輯：或者，更簡潔/類似於您的代碼：

df %>%
  mutate_by(customerInState = sum(customers),
            saleInState = sum(sales), group = state) %>%
  mutate_by(customerInCounty = sum(customers),
            saleInCounty = sum(sales), group = county) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(minSale = min(salePerCountyPercent), group = state)

Answer 2

啊，你的意思是語法風格。 不，這不是tidyverse如何運行，我害怕。 你想要整齊，你最好使用管道。 但是：（i）一旦您對某些內容進行了分組，它將保持分組，直到您再次使用不同的列進行分組。 （ii）如果再次分組，則無需取消分組。 因此，我們可以縮短您的代碼：

df3 <- df %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers), 
         saleInCounty = sum(sales)) %>% 
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

兩個變異和兩個group_by。

現在：列的順序不同，但我們可以輕松測試數據是否相同：

identical((df3 %>% select(colnames(df2))), (df2)) # TRUE

（iii）我不知道美國的行政結構，但我認為各州都嵌入國家，對嗎？ 那么使用總結怎么樣？ 您是否需要保留所有個人銷售額，或者是否足以按州和/或州統計數據生成？

Answer 3

您可以分兩步完成，創建兩個數據集，然后left_join 。

library(dplyr)

df2 <- df %>%
  group_by(state) %>% 
  summarise(customerInState = sum(customers),
         saleInState = sum(sales))

df3 <- df %>%
  group_by(state, county) %>%
  summarise(customerInCounty = sum(customers),
            saleInCounty = sum(sales))

df2 <- left_join(df2, df3) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent))

最后清理。

rm(df3)

在dplyr中的行組中變異列

問題描述

3 個解決方案

解決方案1
4 2019-07-16 15:11:51

解決方案2
3 2019-07-16 15:03:48

解決方案3
3 2019-07-16 15:06:24

在dplyr中的行組中變異列

問題描述

3 個解決方案

解決方案1 4 2019-07-16 15:11:51

解決方案2 3 2019-07-16 15:03:48

解決方案3 3 2019-07-16 15:06:24

解決方案1
4 2019-07-16 15:11:51

解決方案2
3 2019-07-16 15:03:48

解決方案3
3 2019-07-16 15:06:24