[英]In line group by in dplyr to mutate columns
我正在嘗試創建按不同列分組的新列,但我不確定我這樣做是否是使用group_by的最佳方式。 我想知道是否有一種方法可以將group_by排成一行?
我知道可以使用data.table包來完成,其中語法是DT [i,j,by]類型。
但由於這是一個較大的代碼中的一小部分,它使用tidyverse並且工作得很好,我只是不想偏離它。
## Creating Sample Data Frame
state <- rep(c("OH", "IL", "IN", "PA", "KY"),10)
county <- sample(LETTERS[1:5], 50, replace = T) %>% str_c(state,sep = "-")
customers <- sample.int(50:100,50)
sales <- sample.int(500:5000,50)
df <- bind_cols(data.frame(state, county,customers,sales))
## workflow
df2 <- df %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales)) %>%
ungroup %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
ungroup %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
我希望我的代碼看起來像
df3 <- df %>%
mutate(customerInState = sum(customers, by = state),
saleInState = sum(sales, by = state),
customerInCounty = sum(customers, by = county),
saleInCounty = sum(sales, by = county),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState,
minSale = min(salePerCountyPercent, by = state))
它運行沒有錯誤,但我知道輸出不對
我知道有可能在變異中玩雜耍以獲得我需要的更少量的group_bys。 但問題是,如果在dplyr中有線組進行
您可以創建包裝器來執行您想要的操作。 如果您有一個分組變量,則此特定解決方案有效。 祝好運!
library(tidyverse)
mutate_by <- function(.data, group, ...) {
group_by(.data, !!enquo(group)) %>%
mutate(...) %>%
ungroup
}
df1 <- df %>%
mutate_by(state,
customerInState = sum(customers),
saleInState = sum(sales)) %>%
mutate_by(county,
customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(state,
minSale = min(salePerCountyPercent))
identical(df2, df1)
[1] TRUE
編輯:或者,更簡潔/類似於您的代碼:
df %>%
mutate_by(customerInState = sum(customers),
saleInState = sum(sales), group = state) %>%
mutate_by(customerInCounty = sum(customers),
saleInCounty = sum(sales), group = county) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate_by(minSale = min(salePerCountyPercent), group = state)
啊,你的意思是語法風格。 不,這不是tidyverse如何運行,我害怕。 你想要整齊,你最好使用管道。 但是:(i)一旦您對某些內容進行了分組,它將保持分組,直到您再次使用不同的列進行分組。 (ii)如果再次分組,則無需取消分組。 因此,我們可以縮短您的代碼:
df3 <- df %>%
group_by(county) %>%
mutate(customerInCounty = sum(customers),
saleInCounty = sum(sales)) %>%
group_by(state) %>%
mutate(customerInState = sum(customers),
saleInState = sum(sales),
salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
mutate(minSale = min(salePerCountyPercent)) %>%
ungroup
兩個變異和兩個group_by。
現在:列的順序不同,但我們可以輕松測試數據是否相同:
identical((df3 %>% select(colnames(df2))), (df2)) # TRUE
(iii)我不知道美國的行政結構,但我認為各州都嵌入國家,對嗎? 那么使用總結怎么樣? 您是否需要保留所有個人銷售額,或者是否足以按州和/或州統計數據生成?
您可以分兩步完成,創建兩個數據集,然后left_join
。
library(dplyr)
df2 <- df %>%
group_by(state) %>%
summarise(customerInState = sum(customers),
saleInState = sum(sales))
df3 <- df %>%
group_by(state, county) %>%
summarise(customerInCounty = sum(customers),
saleInCounty = sum(sales))
df2 <- left_join(df2, df3) %>%
mutate(salePerCountyPercent = saleInCounty/saleInState,
customerPerCountyPercent = customerInCounty/customerInState) %>%
group_by(state) %>%
mutate(minSale = min(salePerCountyPercent))
最后清理。
rm(df3)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.