[英]How to do conditional grouping of data in R?
这里我有按年份和型号的销售数据:
df <- data.frame (model = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,1180))
model Year sales
1 A 2017 900
..................
17 B 2020 14
18 C 2020 15
19 D 2020 170
20 E 2020 1180
这里我添加份额和累积份额列,并应用以下条件语句:如果模型在2020 年的累积份额 > 90%,则归类为“不显着”。 因此,该条件仅适用于 2020 年,然后结果将分布在整个期间。 例如,如果我在 2020 年将 E、A 模型和其他归类为不显着,接下来我需要将 E 和 A 模型分开并将其他模型转换为不显着。
df2 <- df %>%
group_by(Year) %>% mutate(Share = 100 * sales/ sum(sales),
order = order(order(-Share))) %>% arrange(Year, order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>%ungroup() %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90]) %>%
mutate(model = ifelse(threshold.90, model, 'insignificant'))
model Year sales Share order CumulativeShare threshold.90
1 A 2017 900 43.2692308 1 43.26923 TRUE
2 insignificant 2017 456 21.9230769 2 65.19231 FALSE
3 insignificant 2017 345 16.5865385 3 81.77885 FALSE
4 insignificant 2017 235 11.2980769 4 93.07692 FALSE
5 E 2017 144 6.9230769 5 100.00000 TRUE
6 insignificant 2018 555 24.9662618 1 24.96626 FALSE
7 insignificant 2018 456 20.5128205 2 45.47908 FALSE
8 insignificant 2018 445 20.0179937 3 65.49708 FALSE
9 E 2018 434 19.5231669 4 85.02024 TRUE
10 A 2018 333 14.9797571 5 100.00000 TRUE
11 A 2019 8911 31.6452999 1 31.64530 TRUE
12 E 2019 5555 19.7272630 2 51.37256 TRUE
13 insignificant 2019 4567 16.2186157 3 67.59118 FALSE
14 insignificant 2019 4566 16.2150645 4 83.80624 FALSE
15 insignificant 2019 4560 16.1937569 5 100.00000 FALSE
16 E 2020 1180 73.6119775 1 73.61198 TRUE
17 A 2020 224 13.9737991 2 87.58578 TRUE
18 insignificant 2020 170 10.6051154 3 98.19089 FALSE
19 insignificant 2020 15 0.9357455 4 99.12664 FALSE
20 insignificant 2020 14 0.8733624 5 100.00000 FALSE
但是,如果单个模型在 2020 年的份额在逻辑上超过 90%,那么所有模型的累积份额都将超过 90%。 因此,它们都被归类为“无关紧要”。 例如,如果我们将数据框中的最后一个值从 1180 更改为 20000,则输出如下所示:
df <- data.frame (model = c("A","B","C","D","E","A","B","C","D","E","A","B","C","D","E","A","B","C","D","E"),
Year = c(2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2020,2020,2020,2020,2020),
sales = c(900,235,456,345,144,333,555,445,456,434,8911,4560,4567,4566,5555,224,14,15,170,20000))
df2 <- df %>% ...:
model Year sales Share order CumulativeShare threshold.90
1 insignificant 2017 900 43.26923077 1 43.26923 FALSE
2 insignificant 2017 456 21.92307692 2 65.19231 FALSE
3 insignificant 2017 345 16.58653846 3 81.77885 FALSE
4 insignificant 2017 235 11.29807692 4 93.07692 FALSE
5 insignificant 2017 144 6.92307692 5 100.00000 FALSE
6 insignificant 2018 555 24.96626181 1 24.96626 FALSE
7 insignificant 2018 456 20.51282051 2 45.47908 FALSE
8 insignificant 2018 445 20.01799370 3 65.49708 FALSE
9 insignificant 2018 434 19.52316689 4 85.02024 FALSE
10 insignificant 2018 333 14.97975709 5 100.00000 FALSE
11 insignificant 2019 8911 31.64529990 1 31.64530 FALSE
12 insignificant 2019 5555 19.72726304 2 51.37256 FALSE
13 insignificant 2019 4567 16.21861572 3 67.59118 FALSE
14 insignificant 2019 4566 16.21506446 4 83.80624 FALSE
15 insignificant 2019 4560 16.19375688 5 100.00000 FALSE
16 insignificant 2020 20000 97.92880576 1 97.92881 FALSE
17 insignificant 2020 224 1.09680262 2 99.02561 FALSE
18 insignificant 2020 170 0.83239485 3 99.85800 FALSE
19 insignificant 2020 15 0.07344660 4 99.93145 FALSE
20 insignificant 2020 14 0.06855016 5 100.00000 FALSE
因此,我想避免这种特定情况并添加一个条件:
如果2020年单个模型的SHARE超过90%,则应单独保留,所有其他模型应归类为不显着。
预期输出:
model Year sales Share order CumulativeShare threshold.90
1 insignificant 2017 900 43.26923077 1 43.26923 FALSE
2 insignificant 2017 456 21.92307692 2 65.19231 FALSE
3 insignificant 2017 345 16.58653846 3 81.77885 FALSE
4 insignificant 2017 235 11.29807692 4 93.07692 FALSE
5 E 2017 144 6.92307692 5 100.00000 FALSE
6 insignificant 2018 555 24.96626181 1 24.96626 FALSE
7 insignificant 2018 456 20.51282051 2 45.47908 FALSE
8 insignificant 2018 445 20.01799370 3 65.49708 FALSE
9 E 2018 434 19.52316689 4 85.02024 FALSE
10 insignificant 2018 333 14.97975709 5 100.00000 FALSE
11 insignificant 2019 8911 31.64529990 1 31.64530 FALSE
12 E 2019 5555 19.72726304 2 51.37256 FALSE
13 insignificant 2019 4567 16.21861572 3 67.59118 FALSE
14 insignificant 2019 4566 16.21506446 4 83.80624 FALSE
15 insignificant 2019 4560 16.19375688 5 100.00000 FALSE
16 E 20000 97.92880576 1 97.92881 FALSE
17 insignificant 2020 224 1.09680262 2 99.02561 FALSE
18 insignificant 2020 170 0.83239485 3 99.85800 FALSE
19 insignificant 2020 15 0.07344660 4 99.93145 FALSE
20 insignificant 2020 14 0.06855016 5 100.00000 FALSE
我想我会使用几个临时变量来帮助您在这里跟踪。 本质上,您需要知道最后一年的第一名模型以及最后一年的累积值。 然后,任何满足“最后一年少于 90 或最后一年首次进入”条件的模型都将被保留。
df %>%
group_by(Year) %>%
mutate(Share = 100 * sales/ sum(sales),
order = order(order(-Share))) %>%
arrange(Year, order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>%
ungroup() %>%
mutate(finalyear = Year == max(Year),
finval = CumulativeShare[finalyear][match(model, model[finalyear])],
finlast = c(FALSE, diff(finalyear) == 1),
keep = finval <90 | finlast[finalyear][match(model, model[finalyear])],
model = ifelse(keep, model, 'insignificant')) %>%
select(-finalyear, -finval, -finlast, -keep)
使用您的第一个示例数据集,这看起来像
#> # A tibble: 20 x 6
#> model Year sales Share order CumulativeShare
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 A 2017 900 43.3 1 43.3
#> 2 insignificant 2017 456 21.9 2 65.2
#> 3 insignificant 2017 345 16.6 3 81.8
#> 4 insignificant 2017 235 11.3 4 93.1
#> 5 E 2017 144 6.92 5 100
#> 6 insignificant 2018 555 25.0 1 25.0
#> 7 insignificant 2018 456 20.5 2 45.5
#> 8 insignificant 2018 445 20.0 3 65.5
#> 9 E 2018 434 19.5 4 85.0
#> 10 A 2018 333 15.0 5 100
#> 11 A 2019 8911 31.6 1 31.6
#> 12 E 2019 5555 19.7 2 51.4
#> 13 insignificant 2019 4567 16.2 3 67.6
#> 14 insignificant 2019 4566 16.2 4 83.8
#> 15 insignificant 2019 4560 16.2 5 100
#> 16 E 2020 1180 73.6 1 73.6
#> 17 A 2020 224 14.0 2 87.6
#> 18 insignificant 2020 170 10.6 3 98.2
#> 19 insignificant 2020 15 0.936 4 99.1
#> 20 insignificant 2020 14 0.873 5 100
使用您的第二个数据集,它看起来像这样:
#> # A tibble: 20 x 6
#> model Year sales Share order CumulativeShare
#> <chr> <dbl> <dbl> <dbl> <int> <dbl>
#> 1 insignificant 2017 900 43.3 1 43.3
#> 2 insignificant 2017 456 21.9 2 65.2
#> 3 insignificant 2017 345 16.6 3 81.8
#> 4 insignificant 2017 235 11.3 4 93.1
#> 5 E 2017 144 6.92 5 100
#> 6 insignificant 2018 555 25.0 1 25.0
#> 7 insignificant 2018 456 20.5 2 45.5
#> 8 insignificant 2018 445 20.0 3 65.5
#> 9 E 2018 434 19.5 4 85.0
#> 10 insignificant 2018 333 15.0 5 100
#> 11 insignificant 2019 8911 31.6 1 31.6
#> 12 E 2019 5555 19.7 2 51.4
#> 13 insignificant 2019 4567 16.2 3 67.6
#> 14 insignificant 2019 4566 16.2 4 83.8
#> 15 insignificant 2019 4560 16.2 5 100
#> 16 E 2020 20000 97.9 1 97.9
#> 17 insignificant 2020 224 1.10 2 99.0
#> 18 insignificant 2020 170 0.832 3 99.9
#> 19 insignificant 2020 15 0.0734 4 99.9
#> 20 insignificant 2020 14 0.0686 5 100
由reprex 包于 2022-07-14 创建 (v2.0.1)
这应该可以解决问题:
df %>%
group_by(Year) %>%
mutate(Share = 100 * sales/ sum(sales),order = order(order(-Share))) %>% arrange(Year,order, by_group = TRUE) %>%
mutate(CumulativeShare= cumsum(Share)) %>% ungroup() %>%
#added this
mutate(test = ifelse(Year == max(Year) & Share > 90, model, NA)) %>%
mutate(threshold.90 = model %in% model[Year == max(Year) & CumulativeShare < 90] ) %>%
#and modified that
mutate(model = ifelse(threshold.90 | !is.na(test), model, 'insignificant'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.