简体   繁体   English

如何在R中对数据进行条件分组?

[英]How to do conditional grouping of data in R?

Here I have sales data by year and model:这里我有按年份和型号的销售数据:

df <- data.frame (model  = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
 Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
                  sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,1180))

   model Year sales
1      A 2017   900
..................
17     B 2020    14
18     C 2020    15
19     D 2020   170
20     E 2020 1180

Here I add the share & cumulative shares columns and apply the following conditional statement: If the model has a cumulative share > 90% in 2020 , it is categorized as "insignificant".这里我添加份额累积份额列,并应用以下条件语句:如果模型在2020 年的累积份额 > 90%,则归类为“不显着”。 So the condition is applied to only 2020 and next the result is spread over the entire period.因此,该条件仅适用于 2020 年,然后结果将分布在整个期间。 For instance, If I get E, A models and others categorised as insignificant in 2020, next I need to separate E and A models and convert others as insignificant every year.例如,如果我在 2020 年将 E、A 模型和其他归类为不显着,接下来我需要将 E 和 A 模型分开并将其他模型转换为不显着。

    df2 <- df  %>% 
      group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
             order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
      mutate(CumulativeShare= cumsum(Share)) %>%ungroup() %>%
  mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90]) %>%
  mutate(model = ifelse(threshold.90, model, 'insignificant'))

           model Year sales      Share order CumulativeShare threshold.90
1              A 2017   900 43.2692308     1        43.26923         TRUE
2  insignificant 2017   456 21.9230769     2        65.19231        FALSE
3  insignificant 2017   345 16.5865385     3        81.77885        FALSE
4  insignificant 2017   235 11.2980769     4        93.07692        FALSE
5              E 2017   144  6.9230769     5       100.00000         TRUE
6  insignificant 2018   555 24.9662618     1        24.96626        FALSE
7  insignificant 2018   456 20.5128205     2        45.47908        FALSE
8  insignificant 2018   445 20.0179937     3        65.49708        FALSE
9              E 2018   434 19.5231669     4        85.02024         TRUE
10             A 2018   333 14.9797571     5       100.00000         TRUE
11             A 2019  8911 31.6452999     1        31.64530         TRUE
12             E 2019  5555 19.7272630     2        51.37256         TRUE
13 insignificant 2019  4567 16.2186157     3        67.59118        FALSE
14 insignificant 2019  4566 16.2150645     4        83.80624        FALSE
15 insignificant 2019  4560 16.1937569     5       100.00000        FALSE
16             E 2020  1180 73.6119775     1        73.61198         TRUE
17             A 2020   224 13.9737991     2        87.58578         TRUE
18 insignificant 2020   170 10.6051154     3        98.19089        FALSE
19 insignificant 2020    15  0.9357455     4        99.12664        FALSE
20 insignificant 2020    14  0.8733624     5       100.00000        FALSE

However, if single model has share above 90% in 2020 logically all the model would have cumulative share above 90%.但是,如果单个模型在 2020 年的份额在逻辑上超过 90%,那么所有模型的累积份额都将超过 90%。 Hence, all of them are categorised as "insignificant".因此,它们都被归类为“无关紧要”。 For example if we change last value from 1180 to 20000 in dataframe, output well be like this:例如,如果我们将数据框中的最后一个值从 1180 更改为 20000,则输出如下所示:

df <- data.frame (model  = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
 Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
                  sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,20000))
df2 <- df  %>% ...:


           model Year sales       Share order CumulativeShare threshold.90
1  insignificant 2017   900 43.26923077     1        43.26923        FALSE
2  insignificant 2017   456 21.92307692     2        65.19231        FALSE
3  insignificant 2017   345 16.58653846     3        81.77885        FALSE
4  insignificant 2017   235 11.29807692     4        93.07692        FALSE
5  insignificant 2017   144  6.92307692     5       100.00000        FALSE
6  insignificant 2018   555 24.96626181     1        24.96626        FALSE
7  insignificant 2018   456 20.51282051     2        45.47908        FALSE
8  insignificant 2018   445 20.01799370     3        65.49708        FALSE
9  insignificant 2018   434 19.52316689     4        85.02024        FALSE
10 insignificant 2018   333 14.97975709     5       100.00000        FALSE
11 insignificant 2019  8911 31.64529990     1        31.64530        FALSE
12 insignificant 2019  5555 19.72726304     2        51.37256        FALSE
13 insignificant 2019  4567 16.21861572     3        67.59118        FALSE
14 insignificant 2019  4566 16.21506446     4        83.80624        FALSE
15 insignificant 2019  4560 16.19375688     5       100.00000        FALSE
16 insignificant 2020 20000 97.92880576     1        97.92881        FALSE
17 insignificant 2020   224  1.09680262     2        99.02561        FALSE
18 insignificant 2020   170  0.83239485     3        99.85800        FALSE
19 insignificant 2020    15  0.07344660     4        99.93145        FALSE
20 insignificant 2020    14  0.06855016     5       100.00000        FALSE

So, I want to avoid this specific situation and add one more condition:因此,我想避免这种特定情况并添加一个条件:

If a single model's SHARE is more than 90% in 2020, it should be left separate and all the other models should be categorised as insignificant .如果2020年单个模型的SHARE超过90%,则应单独保留,所有其他模型应归类为不显着

Expected output:预期输出:

           model Year sales       Share order CumulativeShare threshold.90
1  insignificant 2017   900 43.26923077     1        43.26923        FALSE
2  insignificant 2017   456 21.92307692     2        65.19231        FALSE
3  insignificant 2017   345 16.58653846     3        81.77885        FALSE
4  insignificant 2017   235 11.29807692     4        93.07692        FALSE
5  E             2017   144  6.92307692     5       100.00000        FALSE
6  insignificant 2018   555 24.96626181     1        24.96626        FALSE
7  insignificant 2018   456 20.51282051     2        45.47908        FALSE
8  insignificant 2018   445 20.01799370     3        65.49708        FALSE
9  E             2018   434 19.52316689     4        85.02024        FALSE
10 insignificant 2018   333 14.97975709     5       100.00000        FALSE
11 insignificant 2019  8911 31.64529990     1        31.64530        FALSE
12 E             2019  5555 19.72726304     2        51.37256        FALSE
13 insignificant 2019  4567 16.21861572     3        67.59118        FALSE
14 insignificant 2019  4566 16.21506446     4        83.80624        FALSE
15 insignificant 2019  4560 16.19375688     5       100.00000        FALSE
16 E                  20000 97.92880576     1        97.92881        FALSE
17 insignificant 2020   224  1.09680262     2        99.02561        FALSE
18 insignificant 2020   170  0.83239485     3        99.85800        FALSE
19 insignificant 2020    15  0.07344660     4        99.93145        FALSE
20 insignificant 2020    14  0.06855016     5       100.00000        FALSE

I think I would use a couple of temporary variables to help you keep track here.我想我会使用几个临时变量来帮助您在这里跟踪。 Essentially you need to know the first-placed model in the final year as well as the cumulative values of the final year.本质上,您需要知道最后一年的第一名模型以及最后一年的累积值。 Then any model that meets the conditions 'Less than 90 in the final year OR first entry in the final year' is retained.然后,任何满足“最后一年少于 90 或最后一年首次进入”条件的模型都将被保留。

df  %>% 
  group_by(Year) %>% 
  mutate(Share = 100 * sales/ sum(sales),
        order = order(order(-Share))) %>% 
  arrange(Year, order, by_group = TRUE) %>%
  mutate(CumulativeShare= cumsum(Share)) %>%
  ungroup() %>%
  mutate(finalyear = Year == max(Year),
         finval = CumulativeShare[finalyear][match(model, model[finalyear])],
         finlast = c(FALSE, diff(finalyear) == 1),
         keep = finval <90 | finlast[finalyear][match(model, model[finalyear])],
         model = ifelse(keep, model, 'insignificant')) %>%
  select(-finalyear, -finval, -finlast, -keep)

With your first example data set, this would look like使用您的第一个示例数据集,这看起来像

#> # A tibble: 20 x 6
#>    model          Year sales  Share order CumulativeShare
#>    <chr>         <dbl> <dbl>  <dbl> <int>           <dbl>
#>  1 A              2017   900 43.3       1            43.3
#>  2 insignificant  2017   456 21.9       2            65.2
#>  3 insignificant  2017   345 16.6       3            81.8
#>  4 insignificant  2017   235 11.3       4            93.1
#>  5 E              2017   144  6.92      5           100  
#>  6 insignificant  2018   555 25.0       1            25.0
#>  7 insignificant  2018   456 20.5       2            45.5
#>  8 insignificant  2018   445 20.0       3            65.5
#>  9 E              2018   434 19.5       4            85.0
#> 10 A              2018   333 15.0       5           100  
#> 11 A              2019  8911 31.6       1            31.6
#> 12 E              2019  5555 19.7       2            51.4
#> 13 insignificant  2019  4567 16.2       3            67.6
#> 14 insignificant  2019  4566 16.2       4            83.8
#> 15 insignificant  2019  4560 16.2       5           100  
#> 16 E              2020  1180 73.6       1            73.6
#> 17 A              2020   224 14.0       2            87.6
#> 18 insignificant  2020   170 10.6       3            98.2
#> 19 insignificant  2020    15  0.936     4            99.1
#> 20 insignificant  2020    14  0.873     5           100

And with your second data set, it would look like this:使用您的第二个数据集,它看起来像这样:

#> # A tibble: 20 x 6
#>    model          Year sales   Share order CumulativeShare
#>    <chr>         <dbl> <dbl>   <dbl> <int>           <dbl>
#>  1 insignificant  2017   900 43.3        1            43.3
#>  2 insignificant  2017   456 21.9        2            65.2
#>  3 insignificant  2017   345 16.6        3            81.8
#>  4 insignificant  2017   235 11.3        4            93.1
#>  5 E              2017   144  6.92       5           100  
#>  6 insignificant  2018   555 25.0        1            25.0
#>  7 insignificant  2018   456 20.5        2            45.5
#>  8 insignificant  2018   445 20.0        3            65.5
#>  9 E              2018   434 19.5        4            85.0
#> 10 insignificant  2018   333 15.0        5           100  
#> 11 insignificant  2019  8911 31.6        1            31.6
#> 12 E              2019  5555 19.7        2            51.4
#> 13 insignificant  2019  4567 16.2        3            67.6
#> 14 insignificant  2019  4566 16.2        4            83.8
#> 15 insignificant  2019  4560 16.2        5           100  
#> 16 E              2020 20000 97.9        1            97.9
#> 17 insignificant  2020   224  1.10       2            99.0
#> 18 insignificant  2020   170  0.832      3            99.9
#> 19 insignificant  2020    15  0.0734     4            99.9
#> 20 insignificant  2020    14  0.0686     5           100

Created on 2022-07-14 by the reprex package (v2.0.1)reprex 包于 2022-07-14 创建 (v2.0.1)

This should do the trick:这应该可以解决问题:

df  %>% 
group_by(Year) %>% 
mutate(Share = 100 * sales/ sum(sales),order = order(order(-Share))) %>% arrange(Year,order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>% ungroup() %>%
#added this
mutate(test = ifelse(Year == max(Year) &  Share > 90, model, NA)) %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90] ) %>%
#and modified that
mutate(model = ifelse(threshold.90 | !is.na(test), model, 'insignificant'))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM