[英]lump factor based on another column
該示例顯示了對不同工廠的生產輸出的測量,其中第一列表示工廠,最后一列表示生產量。
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 C 20
9 D 5
現在,我想根據工廠在此數據集中的總產量,將它們歸為較少的級別。
使用正常的 forcats::fct_lump,我可以根據出現的行數將它們歸並,例如制作 3 個級別:
library(tidyverse)
df %>% mutate(factory=fct_lump(factory,2))
factory production
1 A 15
2 A 2
3 B 1
4 B 1
5 B 2
6 B 1
7 B 2
8 Other 20
9 Other 5
但我想根據總和(生產)將它們歸並,保留前 n=2 家工廠(按總產量),並將其余工廠歸為一類。 想要的結果:
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
有什么建議?
謝謝!
這里的關鍵是應用特定的理念,以便根據工廠的總產量將工廠組合在一起。 請注意,這種理念與您在(真實)數據集中的實際值有關。
選項1
這是一個將總產量等於或小於 15 的工廠組合在一起的示例。 如果您想要另一個分組,您可以修改閾值(例如使用 18 而不是 15)
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
我正在創建factory_new
而不刪除(原始) factory
列。
選項 2
這是一個示例,您可以根據工廠的產量對工廠進行排名/排序,然后您可以選擇一些頂級工廠以保持原樣並將其余工廠分組
factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)
library(dplyr)
# get ranked factories based on sum production
df %>%
group_by(factory) %>%
summarise(SumProd = sum(production)) %>%
arrange(desc(SumProd)) %>%
pull(factory) -> vec_top_factories
# input how many top factories you want to keep
# rest will be grouped together
n = 2
# apply the grouping based on n provided
df %>%
group_by(factory) %>%
mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
ungroup()
# # A tibble: 9 x 3
# factory production factory_new
# <chr> <dbl> <chr>
# 1 A 15 A
# 2 A 2 A
# 3 B 1 Other
# 4 B 1 Other
# 5 B 2 Other
# 6 B 1 Other
# 7 B 2 Other
# 8 C 20 C
# 9 D 5 Other
我們也可以通過使用ave
創建邏輯條件來使用base R
df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]
只需指定權重參數w
:
> df %>%
+ mutate(factory = fct_lump_n(factory, 2, w = production))
factory production
1 A 15
2 A 2
3 Other 1
4 Other 1
5 Other 2
6 Other 1
7 Other 2
8 C 20
9 Other 5
注意:使用forcats::fct_lump_n
因為不再推薦通用fct_lump
。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.