[英]How to do conditional grouping of data in R?
Here I have sales data by year and model:这里我有按年份和型号的销售数据:
df <- data.frame (model = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,1180))
model Year sales
1 A 2017 900
..................
17 B 2020 14
18 C 2020 15
19 D 2020 170
20 E 2020 1180
Here I add the share & cumulative shares columns and apply the following conditional statement: If the model has a cumulative share > 90% in 2020 , it is categorized as "insignificant".这里我添加份额和累积份额列,并应用以下条件语句:如果模型在2020 年的累积份额 > 90%,则归类为“不显着”。 So the condition is applied to only 2020 and next the result is spread over the entire period.因此,该条件仅适用于 2020 年,然后结果将分布在整个期间。 For instance, If I get E, A models and others categorised as insignificant in 2020, next I need to separate E and A models and convert others as insignificant every year.例如,如果我在 2020 年将 E、A 模型和其他归类为不显着,接下来我需要将 E 和 A 模型分开并将其他模型转换为不显着。
df2 <- df %>%
group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>%ungroup() %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90]) %>%
mutate(model = ifelse(threshold.90, model, 'insignificant'))
model Year sales Share order CumulativeShare threshold.90
1 A 2017 900 43.2692308 1 43.26923 TRUE
2 insignificant 2017 456 21.9230769 2 65.19231 FALSE
3 insignificant 2017 345 16.5865385 3 81.77885 FALSE
4 insignificant 2017 235 11.2980769 4 93.07692 FALSE
5 E 2017 144 6.9230769 5 100.00000 TRUE
6 insignificant 2018 555 24.9662618 1 24.96626 FALSE
7 insignificant 2018 456 20.5128205 2 45.47908 FALSE
8 insignificant 2018 445 20.0179937 3 65.49708 FALSE
9 E 2018 434 19.5231669 4 85.02024 TRUE
10 A 2018 333 14.9797571 5 100.00000 TRUE
11 A 2019 8911 31.6452999 1 31.64530 TRUE
12 E 2019 5555 19.7272630 2 51.37256 TRUE
13 insignificant 2019 4567 16.2186157 3 67.59118 FALSE
14 insignificant 2019 4566 16.2150645 4 83.80624 FALSE
15 insignificant 2019 4560 16.1937569 5 100.00000 FALSE
16 E 2020 1180 73.6119775 1 73.61198 TRUE
17 A 2020 224 13.9737991 2 87.58578 TRUE
18 insignificant 2020 170 10.6051154 3 98.19089 FALSE
19 insignificant 2020 15 0.9357455 4 99.12664 FALSE
20 insignificant 2020 14 0.8733624 5 100.00000 FALSE
However, if single model has share above 90% in 2020 logically all the model would have cumulative share above 90%.但是,如果单个模型在 2020 年的份额在逻辑上超过 90%,那么所有模型的累积份额都将超过 90%。 Hence, all of them are categorised as "insignificant".因此,它们都被归类为“无关紧要”。 For example if we change last value from 1180 to 20000 in dataframe, output well be like this:例如,如果我们将数据框中的最后一个值从 1180 更改为 20000,则输出如下所示:
df <- data.frame (model = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,20000))
df2 <- df %>% ...:
model Year sales Share order CumulativeShare threshold.90
1 insignificant 2017 900 43.26923077 1 43.26923 FALSE
2 insignificant 2017 456 21.92307692 2 65.19231 FALSE
3 insignificant 2017 345 16.58653846 3 81.77885 FALSE
4 insignificant 2017 235 11.29807692 4 93.07692 FALSE
5 insignificant 2017 144 6.92307692 5 100.00000 FALSE
6 insignificant 2018 555 24.96626181 1 24.96626 FALSE
7 insignificant 2018 456 20.51282051 2 45.47908 FALSE
8 insignificant 2018 445 20.01799370 3 65.49708 FALSE
9 insignificant 2018 434 19.52316689 4 85.02024 FALSE
10 insignificant 2018 333 14.97975709 5 100.00000 FALSE
11 insignificant 2019 8911 31.64529990 1 31.64530 FALSE
12 insignificant 2019 5555 19.72726304 2 51.37256 FALSE
13 insignificant 2019 4567 16.21861572 3 67.59118 FALSE
14 insignificant 2019 4566 16.21506446 4 83.80624 FALSE
15 insignificant 2019 4560 16.19375688 5 100.00000 FALSE
16 insignificant 2020 20000 97.92880576 1 97.92881 FALSE
17 insignificant 2020 224 1.09680262 2 99.02561 FALSE
18 insignificant 2020 170 0.83239485 3 99.85800 FALSE
19 insignificant 2020 15 0.07344660 4 99.93145 FALSE
20 insignificant 2020 14 0.06855016 5 100.00000 FALSE
So, I want to avoid this specific situation and add one more condition:因此,我想避免这种特定情况并添加一个条件:
If a single model's SHARE is more than 90% in 2020, it should be left separate and all the other models should be categorised as insignificant .如果2020年单个模型的SHARE超过90%,则应单独保留,所有其他模型应归类为不显着。
Expected output:预期输出:
model Year sales Share order CumulativeShare threshold.90
1 insignificant 2017 900 43.26923077 1 43.26923 FALSE
2 insignificant 2017 456 21.92307692 2 65.19231 FALSE
3 insignificant 2017 345 16.58653846 3 81.77885 FALSE
4 insignificant 2017 235 11.29807692 4 93.07692 FALSE
5 E 2017 144 6.92307692 5 100.00000 FALSE
6 insignificant 2018 555 24.96626181 1 24.96626 FALSE
7 insignificant 2018 456 20.51282051 2 45.47908 FALSE
8 insignificant 2018 445 20.01799370 3 65.49708 FALSE
9 E 2018 434 19.52316689 4 85.02024 FALSE
10 insignificant 2018 333 14.97975709 5 100.00000 FALSE
11 insignificant 2019 8911 31.64529990 1 31.64530 FALSE
12 E 2019 5555 19.72726304 2 51.37256 FALSE
13 insignificant 2019 4567 16.21861572 3 67.59118 FALSE
14 insignificant 2019 4566 16.21506446 4 83.80624 FALSE
15 insignificant 2019 4560 16.19375688 5 100.00000 FALSE
16 E 20000 97.92880576 1 97.92881 FALSE
17 insignificant 2020 224 1.09680262 2 99.02561 FALSE
18 insignificant 2020 170 0.83239485 3 99.85800 FALSE
19 insignificant 2020 15 0.07344660 4 99.93145 FALSE
20 insignificant 2020 14 0.06855016 5 100.00000 FALSE
I think I would use a couple of temporary variables to help you keep track here.我想我会使用几个临时变量来帮助您在这里跟踪。 Essentially you need to know the first-placed model in the final year as well as the cumulative values of the final year.本质上,您需要知道最后一年的第一名模型以及最后一年的累积值。 Then any model that meets the conditions 'Less than 90 in the final year OR first entry in the final year' is retained.然后,任何满足“最后一年少于 90 或最后一年首次进入”条件的模型都将被保留。
df %>%
group_by(Year) %>%
mutate(Share = 100 * sales/ sum(sales),
order = order(order(-Share))) %>%
arrange(Year, order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>%
ungroup() %>%
mutate(finalyear = Year == max(Year),
finval = CumulativeShare[finalyear][match(model, model[finalyear])],
finlast = c(FALSE, diff(finalyear) == 1),
keep = finval <90 | finlast[finalyear][match(model, model[finalyear])],
model = ifelse(keep, model, 'insignificant')) %>%
select(-finalyear, -finval, -finlast, -keep)
With your first example data set, this would look like使用您的第一个示例数据集,这看起来像
#> # A tibble: 20 x 6
#> model Year sales Share order CumulativeShare
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 A 2017 900 43.3 1 43.3
#> 2 insignificant 2017 456 21.9 2 65.2
#> 3 insignificant 2017 345 16.6 3 81.8
#> 4 insignificant 2017 235 11.3 4 93.1
#> 5 E 2017 144 6.92 5 100
#> 6 insignificant 2018 555 25.0 1 25.0
#> 7 insignificant 2018 456 20.5 2 45.5
#> 8 insignificant 2018 445 20.0 3 65.5
#> 9 E 2018 434 19.5 4 85.0
#> 10 A 2018 333 15.0 5 100
#> 11 A 2019 8911 31.6 1 31.6
#> 12 E 2019 5555 19.7 2 51.4
#> 13 insignificant 2019 4567 16.2 3 67.6
#> 14 insignificant 2019 4566 16.2 4 83.8
#> 15 insignificant 2019 4560 16.2 5 100
#> 16 E 2020 1180 73.6 1 73.6
#> 17 A 2020 224 14.0 2 87.6
#> 18 insignificant 2020 170 10.6 3 98.2
#> 19 insignificant 2020 15 0.936 4 99.1
#> 20 insignificant 2020 14 0.873 5 100
And with your second data set, it would look like this:使用您的第二个数据集,它看起来像这样:
#> # A tibble: 20 x 6
#> model Year sales Share order CumulativeShare
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 insignificant 2017 900 43.3 1 43.3
#> 2 insignificant 2017 456 21.9 2 65.2
#> 3 insignificant 2017 345 16.6 3 81.8
#> 4 insignificant 2017 235 11.3 4 93.1
#> 5 E 2017 144 6.92 5 100
#> 6 insignificant 2018 555 25.0 1 25.0
#> 7 insignificant 2018 456 20.5 2 45.5
#> 8 insignificant 2018 445 20.0 3 65.5
#> 9 E 2018 434 19.5 4 85.0
#> 10 insignificant 2018 333 15.0 5 100
#> 11 insignificant 2019 8911 31.6 1 31.6
#> 12 E 2019 5555 19.7 2 51.4
#> 13 insignificant 2019 4567 16.2 3 67.6
#> 14 insignificant 2019 4566 16.2 4 83.8
#> 15 insignificant 2019 4560 16.2 5 100
#> 16 E 2020 20000 97.9 1 97.9
#> 17 insignificant 2020 224 1.10 2 99.0
#> 18 insignificant 2020 170 0.832 3 99.9
#> 19 insignificant 2020 15 0.0734 4 99.9
#> 20 insignificant 2020 14 0.0686 5 100
Created on 2022-07-14 by the reprex package (v2.0.1)由reprex 包于 2022-07-14 创建 (v2.0.1)
This should do the trick:这应该可以解决问题:
df %>%
group_by(Year) %>%
mutate(Share = 100 * sales/ sum(sales),order = order(order(-Share))) %>% arrange(Year,order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>% ungroup() %>%
#added this
mutate(test = ifelse(Year == max(Year) & Share > 90, model, NA)) %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90] ) %>%
#and modified that
mutate(model = ifelse(threshold.90 | !is.na(test), model, 'insignificant'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.