基於另一列的集總系數

Question

該示例顯示了對不同工廠的生產輸出的測量，其中第一列表示工廠，最后一列表示生產量。

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production)
df
  factory production
1       A         15
2       A          2
3       B          1
4       B          1
5       B          2
6       B          1
7       B          2
8       C         20
9       D          5

現在，我想根據工廠在此數據集中的總產量，將它們歸為較少的級別。

使用正常的 forcats::fct_lump，我可以根據出現的行數將它們歸並，例如制作 3 個級別：

library(tidyverse)    
df %>% mutate(factory=fct_lump(factory,2))
      factory production
    1       A         15
    2       A          2
    3       B          1
    4       B          1
    5       B          2
    6       B          1
    7       B          2
    8   Other         20
    9   Other          5

但我想根據總和（生產）將它們歸並，保留前 n=2 家工廠（按總產量），並將其余工廠歸為一類。 想要的結果：

1       A         15
2       A          2
3   Other          1
4   Other          1
5   Other          2
6   Other          1
7   Other          2
8       C         20
9   Other          5

有什么建議？

謝謝！

Answer 1

這里的關鍵是應用特定的理念，以便根據工廠的總產量將工廠組合在一起。 請注意，這種理念與您在（真實）數據集中的實際值有關。

選項1

這是一個將總產量等於或小於 15 的工廠組合在一起的示例。 如果您想要另一個分組，您可以修改閾值（例如使用 18 而不是 15）

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(sum(production) > 15, factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other

我正在創建factory_new而不刪除（原始） factory列。

選項 2

這是一個示例，您可以根據工廠的產量對工廠進行排名/排序，然后您可以選擇一些頂級工廠以保持原樣並將其余工廠分組

factory <- c("A","A","B","B","B","B","B","C","D")
production <- c(15, 2, 1, 1, 2, 1, 2,20,5)
df <- data.frame(factory, production, stringsAsFactors = F)

library(dplyr)

# get ranked factories based on sum production
df %>%
  group_by(factory) %>%
  summarise(SumProd = sum(production)) %>%
  arrange(desc(SumProd)) %>%
  pull(factory) -> vec_top_factories

# input how many top factories you want to keep
# rest will be grouped together
n = 2

# apply the grouping based on n provided
df %>%
  group_by(factory) %>%
  mutate(factory_new = ifelse(factory %in% vec_top_factories[1:n], factory, "Other")) %>%
  ungroup()

# # A tibble: 9 x 3
#   factory production factory_new
#   <chr>        <dbl> <chr>      
# 1 A               15 A          
# 2 A                2 A          
# 3 B                1 Other      
# 4 B                1 Other      
# 5 B                2 Other      
# 6 B                1 Other      
# 7 B                2 Other      
# 8 C               20 C          
# 9 D                5 Other

Answer 2

我們也可以通過使用ave創建邏輯條件來使用base R

df$factory_new <- "Other"
i1 <- with(df, ave(production, factory, FUN = sum) > 15)
df$factory_new[i1] <- df$factory[i1]

Answer 3

只需指定權重參數w ：

> df %>% 
+   mutate(factory = fct_lump_n(factory, 2, w = production))
  factory production
1       A         15
2       A          2
3   Other          1
4   Other          1
5   Other          2
6   Other          1
7   Other          2
8       C         20
9   Other          5

注意：使用forcats::fct_lump_n因為不再推薦通用fct_lump 。

基於另一列的集總系數

問題描述

3 個解決方案

解決方案1
4 已采納 2018-10-04 14:58:14

解決方案2
1 2018-10-04 15:15:29

解決方案3
1 2021-02-24 19:46:54

基於另一列的集總系數

問題描述

3 個解決方案

解決方案1 4 已采納 2018-10-04 14:58:14

解決方案2 1 2018-10-04 15:15:29

解決方案3 1 2021-02-24 19:46:54

解決方案1
4 已采納 2018-10-04 14:58:14

解決方案2
1 2018-10-04 15:15:29

解決方案3
1 2021-02-24 19:46:54