簡體   English   中英

如何在R中對數據進行條件分組?

[英]How to do conditional grouping of data in R?

這里我有按年份和型號的銷售數據:

df <- data.frame (model  = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
 Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
                  sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,1180))

   model Year sales
1      A 2017   900
..................
17     B 2020    14
18     C 2020    15
19     D 2020   170
20     E 2020 1180

這里我添加份額累積份額列,並應用以下條件語句:如果模型在2020 年的累積份額 > 90%,則歸類為“不顯着”。 因此,該條件僅適用於 2020 年,然后結果將分布在整個期間。 例如,如果我在 2020 年將 E、A 模型和其他歸類為不顯着,接下來我需要將 E 和 A 模型分開並將其他模型轉換為不顯着。

    df2 <- df  %>% 
      group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
             order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
      mutate(CumulativeShare= cumsum(Share)) %>%ungroup() %>%
  mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90]) %>%
  mutate(model = ifelse(threshold.90, model, 'insignificant'))

           model Year sales      Share order CumulativeShare threshold.90
1              A 2017   900 43.2692308     1        43.26923         TRUE
2  insignificant 2017   456 21.9230769     2        65.19231        FALSE
3  insignificant 2017   345 16.5865385     3        81.77885        FALSE
4  insignificant 2017   235 11.2980769     4        93.07692        FALSE
5              E 2017   144  6.9230769     5       100.00000         TRUE
6  insignificant 2018   555 24.9662618     1        24.96626        FALSE
7  insignificant 2018   456 20.5128205     2        45.47908        FALSE
8  insignificant 2018   445 20.0179937     3        65.49708        FALSE
9              E 2018   434 19.5231669     4        85.02024         TRUE
10             A 2018   333 14.9797571     5       100.00000         TRUE
11             A 2019  8911 31.6452999     1        31.64530         TRUE
12             E 2019  5555 19.7272630     2        51.37256         TRUE
13 insignificant 2019  4567 16.2186157     3        67.59118        FALSE
14 insignificant 2019  4566 16.2150645     4        83.80624        FALSE
15 insignificant 2019  4560 16.1937569     5       100.00000        FALSE
16             E 2020  1180 73.6119775     1        73.61198         TRUE
17             A 2020   224 13.9737991     2        87.58578         TRUE
18 insignificant 2020   170 10.6051154     3        98.19089        FALSE
19 insignificant 2020    15  0.9357455     4        99.12664        FALSE
20 insignificant 2020    14  0.8733624     5       100.00000        FALSE

但是,如果單個模型在 2020 年的份額在邏輯上超過 90%,那么所有模型的累積份額都將超過 90%。 因此,它們都被歸類為“無關緊要”。 例如,如果我們將數據框中的最后一個值從 1180 更改為 20000,則輸出如下所示:

df <- data.frame (model  = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
 Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
                  sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,20000))
df2 <- df  %>% ...:


           model Year sales       Share order CumulativeShare threshold.90
1  insignificant 2017   900 43.26923077     1        43.26923        FALSE
2  insignificant 2017   456 21.92307692     2        65.19231        FALSE
3  insignificant 2017   345 16.58653846     3        81.77885        FALSE
4  insignificant 2017   235 11.29807692     4        93.07692        FALSE
5  insignificant 2017   144  6.92307692     5       100.00000        FALSE
6  insignificant 2018   555 24.96626181     1        24.96626        FALSE
7  insignificant 2018   456 20.51282051     2        45.47908        FALSE
8  insignificant 2018   445 20.01799370     3        65.49708        FALSE
9  insignificant 2018   434 19.52316689     4        85.02024        FALSE
10 insignificant 2018   333 14.97975709     5       100.00000        FALSE
11 insignificant 2019  8911 31.64529990     1        31.64530        FALSE
12 insignificant 2019  5555 19.72726304     2        51.37256        FALSE
13 insignificant 2019  4567 16.21861572     3        67.59118        FALSE
14 insignificant 2019  4566 16.21506446     4        83.80624        FALSE
15 insignificant 2019  4560 16.19375688     5       100.00000        FALSE
16 insignificant 2020 20000 97.92880576     1        97.92881        FALSE
17 insignificant 2020   224  1.09680262     2        99.02561        FALSE
18 insignificant 2020   170  0.83239485     3        99.85800        FALSE
19 insignificant 2020    15  0.07344660     4        99.93145        FALSE
20 insignificant 2020    14  0.06855016     5       100.00000        FALSE

因此,我想避免這種特定情況並添加一個條件:

如果2020年單個模型的SHARE超過90%,則應單獨保留,所有其他模型應歸類為不顯着

預期輸出:

           model Year sales       Share order CumulativeShare threshold.90
1  insignificant 2017   900 43.26923077     1        43.26923        FALSE
2  insignificant 2017   456 21.92307692     2        65.19231        FALSE
3  insignificant 2017   345 16.58653846     3        81.77885        FALSE
4  insignificant 2017   235 11.29807692     4        93.07692        FALSE
5  E             2017   144  6.92307692     5       100.00000        FALSE
6  insignificant 2018   555 24.96626181     1        24.96626        FALSE
7  insignificant 2018   456 20.51282051     2        45.47908        FALSE
8  insignificant 2018   445 20.01799370     3        65.49708        FALSE
9  E             2018   434 19.52316689     4        85.02024        FALSE
10 insignificant 2018   333 14.97975709     5       100.00000        FALSE
11 insignificant 2019  8911 31.64529990     1        31.64530        FALSE
12 E             2019  5555 19.72726304     2        51.37256        FALSE
13 insignificant 2019  4567 16.21861572     3        67.59118        FALSE
14 insignificant 2019  4566 16.21506446     4        83.80624        FALSE
15 insignificant 2019  4560 16.19375688     5       100.00000        FALSE
16 E                  20000 97.92880576     1        97.92881        FALSE
17 insignificant 2020   224  1.09680262     2        99.02561        FALSE
18 insignificant 2020   170  0.83239485     3        99.85800        FALSE
19 insignificant 2020    15  0.07344660     4        99.93145        FALSE
20 insignificant 2020    14  0.06855016     5       100.00000        FALSE

我想我會使用幾個臨時變量來幫助您在這里跟蹤。 本質上,您需要知道最后一年的第一名模型以及最后一年的累積值。 然后,任何滿足“最后一年少於 90 或最后一年首次進入”條件的模型都將被保留。

df  %>% 
  group_by(Year) %>% 
  mutate(Share = 100 * sales/ sum(sales),
        order = order(order(-Share))) %>% 
  arrange(Year, order, by_group = TRUE) %>%
  mutate(CumulativeShare= cumsum(Share)) %>%
  ungroup() %>%
  mutate(finalyear = Year == max(Year),
         finval = CumulativeShare[finalyear][match(model, model[finalyear])],
         finlast = c(FALSE, diff(finalyear) == 1),
         keep = finval <90 | finlast[finalyear][match(model, model[finalyear])],
         model = ifelse(keep, model, 'insignificant')) %>%
  select(-finalyear, -finval, -finlast, -keep)

使用您的第一個示例數據集,這看起來像

#> # A tibble: 20 x 6
#>    model          Year sales  Share order CumulativeShare
#>    <chr>         <dbl> <dbl>  <dbl> <int>           <dbl>
#>  1 A              2017   900 43.3       1            43.3
#>  2 insignificant  2017   456 21.9       2            65.2
#>  3 insignificant  2017   345 16.6       3            81.8
#>  4 insignificant  2017   235 11.3       4            93.1
#>  5 E              2017   144  6.92      5           100  
#>  6 insignificant  2018   555 25.0       1            25.0
#>  7 insignificant  2018   456 20.5       2            45.5
#>  8 insignificant  2018   445 20.0       3            65.5
#>  9 E              2018   434 19.5       4            85.0
#> 10 A              2018   333 15.0       5           100  
#> 11 A              2019  8911 31.6       1            31.6
#> 12 E              2019  5555 19.7       2            51.4
#> 13 insignificant  2019  4567 16.2       3            67.6
#> 14 insignificant  2019  4566 16.2       4            83.8
#> 15 insignificant  2019  4560 16.2       5           100  
#> 16 E              2020  1180 73.6       1            73.6
#> 17 A              2020   224 14.0       2            87.6
#> 18 insignificant  2020   170 10.6       3            98.2
#> 19 insignificant  2020    15  0.936     4            99.1
#> 20 insignificant  2020    14  0.873     5           100

使用您的第二個數據集,它看起來像這樣:

#> # A tibble: 20 x 6
#>    model          Year sales   Share order CumulativeShare
#>    <chr>         <dbl> <dbl>   <dbl> <int>           <dbl>
#>  1 insignificant  2017   900 43.3        1            43.3
#>  2 insignificant  2017   456 21.9        2            65.2
#>  3 insignificant  2017   345 16.6        3            81.8
#>  4 insignificant  2017   235 11.3        4            93.1
#>  5 E              2017   144  6.92       5           100  
#>  6 insignificant  2018   555 25.0        1            25.0
#>  7 insignificant  2018   456 20.5        2            45.5
#>  8 insignificant  2018   445 20.0        3            65.5
#>  9 E              2018   434 19.5        4            85.0
#> 10 insignificant  2018   333 15.0        5           100  
#> 11 insignificant  2019  8911 31.6        1            31.6
#> 12 E              2019  5555 19.7        2            51.4
#> 13 insignificant  2019  4567 16.2        3            67.6
#> 14 insignificant  2019  4566 16.2        4            83.8
#> 15 insignificant  2019  4560 16.2        5           100  
#> 16 E              2020 20000 97.9        1            97.9
#> 17 insignificant  2020   224  1.10       2            99.0
#> 18 insignificant  2020   170  0.832      3            99.9
#> 19 insignificant  2020    15  0.0734     4            99.9
#> 20 insignificant  2020    14  0.0686     5           100

reprex 包於 2022-07-14 創建 (v2.0.1)

這應該可以解決問題:

df  %>% 
group_by(Year) %>% 
mutate(Share = 100 * sales/ sum(sales),order = order(order(-Share))) %>% arrange(Year,order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>% ungroup() %>%
#added this
mutate(test = ifelse(Year == max(Year) &  Share > 90, model, NA)) %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90] ) %>%
#and modified that
mutate(model = ifelse(threshold.90 | !is.na(test), model, 'insignificant'))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM