簡體   English   中英

在dplyr中的行組中變異列

[英]In line group by in dplyr to mutate columns

我正在嘗試創建按不同列分組的新列,但我不確定我這樣做是否是使用group_by的最佳方式。 我想知道是否有一種方法可以將group_by排成一行?

我知道可以使用data.table包來完成,其中語法是DT [i,j,by]類型。

但由於這是一個較大的代碼中的一小部分,它使用tidyverse並且工作得很好,我只是不想偏離它。

## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10) 
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-") 
customers <- sample.int(50:100,50) 
sales <- sample.int(500:5000,50)

df <- bind_cols(data.frame(state, county,customers,sales))

## workflow

df2 <- df %>%
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales)) %>% 
  ungroup %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers),
         saleInCounty = sum(sales)) %>% 
  ungroup %>% 
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

我希望我的代碼看起來像

df3 <- df %>%
  mutate(customerInState = sum(customers, by = state),
         saleInState = sum(sales, by = state),
         customerInCounty = sum(customers, by = county),
         saleInCounty = sum(sales, by = county),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState,
         minSale = min(salePerCountyPercent, by = state))

它運行沒有錯誤,但我知道輸出不對

我知道有可能在變異中玩雜耍以獲得我需要的更少量的group_bys。 但問題是,如果在dplyr中有線組進行

您可以創建包裝器來執行您想要的操作。 如果您有一個分組變量,則此特定解決方案有效。 祝好運!

library(tidyverse)

mutate_by <- function(.data, group, ...) {

  group_by(.data, !!enquo(group)) %>%
    mutate(...) %>%
    ungroup

}

df1 <- df %>%
  mutate_by(state, 
            customerInState = sum(customers),
            saleInState = sum(sales)) %>%
  mutate_by(county,
            customerInCounty = sum(customers),
            saleInCounty = sum(sales)) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(state,
            minSale = min(salePerCountyPercent))

identical(df2, df1)
[1] TRUE

編輯:或者,更簡潔/類似於您的代碼:

df %>%
  mutate_by(customerInState = sum(customers),
            saleInState = sum(sales), group = state) %>%
  mutate_by(customerInCounty = sum(customers),
            saleInCounty = sum(sales), group = county) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate_by(minSale = min(salePerCountyPercent), group = state)

啊,你的意思是語法風格。 不,這不是tidyverse如何運行,我害怕。 你想要整齊,你最好使用管道。 但是:(i)一旦您對某些內容進行了分組,它將保持分組,直到您再次使用不同的列進行分組。 (ii)如果再次分組,則無需取消分組。 因此,我們可以縮短您的代碼:

df3 <- df %>% 
  group_by(county) %>% 
  mutate(customerInCounty = sum(customers), 
         saleInCounty = sum(sales)) %>% 
  group_by(state) %>% 
  mutate(customerInState = sum(customers),
         saleInState = sum(sales),
         salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  mutate(minSale = min(salePerCountyPercent)) %>%
  ungroup

兩個變異和兩個group_by。

現在:列的順序不同,但我們可以輕松測試數據是否相同:

identical((df3 %>% select(colnames(df2))), (df2)) # TRUE

(iii)我不知道美國的行政結構,但我認為各州都嵌入國家,對嗎? 那么使用總結怎么樣? 您是否需要保留所有個人銷售額,或者是否足以按州和/或州統計數據生成?

您可以分兩步完成,創建兩個數據集,然后left_join

library(dplyr)

df2 <- df %>%
  group_by(state) %>% 
  summarise(customerInState = sum(customers),
         saleInState = sum(sales))

df3 <- df %>%
  group_by(state, county) %>%
  summarise(customerInCounty = sum(customers),
            saleInCounty = sum(sales))

df2 <- left_join(df2, df3) %>%
  mutate(salePerCountyPercent  = saleInCounty/saleInState,
         customerPerCountyPercent = customerInCounty/customerInState) %>% 
  group_by(state) %>% 
  mutate(minSale = min(salePerCountyPercent))

最后清理。

rm(df3)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM