[英]Calculate cumulative sum in a group_by() on two different sets of columns in dplyr
My initial dataframe looks like:我最初的 dataframe 看起来像:
library(tidyverse)
df_input <- data.frame(
cohort = c("2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-04-01", "2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01", "2019-04-01", "2019-04-01"),
months = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, NA, NA, NA, NA, 22.2, 38.24,
46.08, 56.28, NA, NA, NA),
CLV_for = c(1.66, 1.42, 1.42, 1.42, 1.18, 1.18, 1.18, 1.18, 0.95, 35.75,
26.1, 16.09, 10.37, 7.15, 6.08, 5.01)
)
cohort months CLV CLV_for
1 2019-03-01 1 59.90 1.66
2 2019-03-01 2 61.10 1.42
3 2019-03-01 3 62.06 1.42
4 2019-03-01 4 62.58 1.42
5 2019-03-01 5 62.83 1.18
6 2019-03-01 6 NA 1.18
7 2019-03-01 7 NA 1.18
8 2019-03-01 8 NA 1.18
9 2019-03-01 9 NA 0.95
10 2019-04-01 1 22.20 35.75
11 2019-04-01 2 38.24 26.10
12 2019-04-01 3 46.08 16.09
13 2019-04-01 4 56.28 10.37
14 2019-04-01 5 NA 7.15
15 2019-04-01 6 NA 6.08
16 2019-04-01 7 NA 5.01
I want to perform a cumulative sum (using cumsum()
in dplyr
) starting from the last non-NA value in each group (aka cohort
) in column CLV
and continuing for the remaining correspondent values in the column CLV_for
.我想从
CLV
列中每个组(又名cohort
)中的最后一个非 NA 值开始执行累积和(在dplyr
中使用cumsum()
,然后继续执行CLV_for
列中剩余的对应值。
In order to better exaplain the calculation, I thought of splitting it in 2 different steps.为了更好地解释计算,我想把它分成两个不同的步骤。
1) Starting from the last non-NA value in CLV column for cohort 2019-03-01
, cumsum()
the corresponding values in column CLV_for
. 1) 从队列
2019-03-01
的 CLV 列中的最后一个非 NA 值开始, cumsum()
列CLV_for
中的相应值。 Same for the cohort 2019-04-01
.队列
2019-04-01
。
df_inter <- data.frame(
cohort = c("2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-04-01", "2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01", "2019-04-01", "2019-04-01"),
months = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, NA, NA, NA, NA, 22.2, 38.24,
46.08, 56.28, NA, NA, NA),
cum_CLV_for = c(NA, NA, NA, NA, NA, 64.01, 65.19, 66.37, 67.32, NA,
NA, NA, NA, 63.43, 69.51, 74.51)
)
cohort months CLV cum_CLV_for
1 2019-03-01 1 59.90 NA
2 2019-03-01 2 61.10 NA
3 2019-03-01 3 62.06 NA
4 2019-03-01 4 62.58 NA
5 2019-03-01 5 62.83 NA
6 2019-03-01 6 NA 64.01 (<- 62.83 + 1.18)
7 2019-03-01 7 NA 65.19 (<- 64.01 + 1.18)
8 2019-03-01 8 NA 66.37 (<- 65.19 + 1.18)
9 2019-03-01 9 NA 67.32 (<- 66.37 + 0.95)
10 2019-04-01 1 22.20 NA
11 2019-04-01 2 38.24 NA
12 2019-04-01 3 46.08 NA
13 2019-04-01 4 56.28 NA
14 2019-04-01 5 NA 63.43 (<- 56.28 + 7.15)
15 2019-04-01 6 NA 69.51 (<- 63.43 + 6.08)
16 2019-04-01 7 NA 74.51 (<- 69.51 + 5.01)
2) The second step is to clean out the two columns merging them into one. 2)第二步是清理将它们合并为一的两列。
df_final <- data.frame(
sub_date = c("2019-03-01", "2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01",
"2019-03-01", "2019-03-01",
"2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01",
"2019-04-01", "2019-04-01",
"2019-04-01"),
months_after_acquisition = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 1, 2, 3, 4, 5, 6, 7),
cum_CLV = c(59.9, 61.1, 62.06, 62.58, 62.83, 64.01, 65.19,
66.37, 67.32, 22.2, 38.24,
46.08, 56.28, 63.43, 69.51,
74.51)
)
sub_date months_after_acquisition cum_CLV
1 2019-03-01 1 59.90
2 2019-03-01 2 61.10
3 2019-03-01 3 62.06
4 2019-03-01 4 62.58
5 2019-03-01 5 62.83
6 2019-03-01 6 64.01
7 2019-03-01 7 65.19
8 2019-03-01 8 66.37
9 2019-03-01 9 67.32
10 2019-04-01 1 22.20
11 2019-04-01 2 38.24
12 2019-04-01 3 46.08
13 2019-04-01 4 56.28
14 2019-04-01 5 63.43
15 2019-04-01 6 69.51
16 2019-04-01 7 74.51
Thanks for your help!谢谢你的帮助!
By taking either CLV
or the vertically filled value of CLV
combined with cumsum
, we get what you want:通过将
CLV
或CLV
的垂直填充值与cumsum
相结合,我们得到你想要的:
df_input %>%
group_by(cohort) %>%
arrange(months, .by_group = T) %>%
mutate(cum_CLV = CLV) %>%
fill(cum_CLV) %>%
mutate(cum_CLV = cum_CLV + cumsum(CLV_for*is.na(CLV)))
# cohort months CLV CLV_for cum_CLV
# <fct> <dbl> <dbl> <dbl> <dbl>
# 1 2019-03-01 1 59.9 1.66 59.9
# 2 2019-03-01 2 61.1 1.42 61.1
# 3 2019-03-01 3 62.1 1.42 62.1
# 4 2019-03-01 4 62.6 1.42 62.6
# 5 2019-03-01 5 62.8 1.18 62.8
# 6 2019-03-01 6 NA 1.18 64.0
# 7 2019-03-01 7 NA 1.18 65.2
# 8 2019-03-01 8 NA 1.18 66.4
# 9 2019-03-01 9 NA 0.95 67.3
# 10 2019-04-01 1 22.2 35.8 22.2
# 11 2019-04-01 2 38.2 26.1 38.2
# 12 2019-04-01 3 46.1 16.1 46.1
# 13 2019-04-01 4 56.3 10.4 56.3
# 14 2019-04-01 5 NA 7.15 63.4
# 15 2019-04-01 6 NA 6.08 69.5
# 16 2019-04-01 7 NA 5.01 74.5
Here's an approach with case_when
:这是
case_when
的一种方法:
library(dplyr)
df_input %>%
group_by(cohort) %>%
mutate(CumCLV = cumsum(case_when(is.na(CLV) ~ CLV_for,
TRUE ~ 0)),
CLV = case_when(is.na(CLV) ~ CumCLV + max(CLV, na.rm = TRUE),
TRUE ~ CLV)) %>%
dplyr::select(-CLV_for, -CumCLV)
# A tibble: 16 x 3
# Groups: cohort [2]
cohort months CLV
<fct> <dbl> <dbl>
1 2019-03-01 1 59.9
2 2019-03-01 2 61.1
3 2019-03-01 3 62.1
4 2019-03-01 4 62.6
5 2019-03-01 5 62.8
6 2019-03-01 6 64.0
7 2019-03-01 7 65.2
8 2019-03-01 8 66.4
9 2019-03-01 9 67.3
10 2019-04-01 1 22.2
11 2019-04-01 2 38.2
12 2019-04-01 3 46.1
13 2019-04-01 4 56.3
14 2019-04-01 5 63.4
15 2019-04-01 6 69.5
16 2019-04-01 7 74.5
a data.table approach for completeness sake为了完整起见,采用 data.table 方法
setDT(df_input)
df_input[, max := max(CLV, na.rm = TRUE), by = cohort]
df_input[ is.na(CLV), CLV := max + cumsum(CLV_for), by = cohort ][, c("max", "CLV_for") := NULL][]
# cohort months CLV
# 1: 2019-03-01 1 59.90
# 2: 2019-03-01 2 61.10
# 3: 2019-03-01 3 62.06
# 4: 2019-03-01 4 62.58
# 5: 2019-03-01 5 62.83
# 6: 2019-03-01 6 64.01
# 7: 2019-03-01 7 65.19
# 8: 2019-03-01 8 66.37
# 9: 2019-03-01 9 67.32
# 10: 2019-04-01 1 22.20
# 11: 2019-04-01 2 38.24
# 12: 2019-04-01 3 46.08
# 13: 2019-04-01 4 56.28
# 14: 2019-04-01 5 63.43
# 15: 2019-04-01 6 69.51
# 16: 2019-04-01 7 74.52
Yet another dplyr
possibility could be:另一个
dplyr
可能性可能是:
df_input %>%
group_by(cohort) %>%
transmute(months,
CLV = if_else(is.na(CLV),
last(na.omit(CLV)) + cumsum(CLV_for * is.na(CLV)),
CLV))
cohort months CLV
<fct> <dbl> <dbl>
1 2019-03-01 1 59.9
2 2019-03-01 2 61.1
3 2019-03-01 3 62.1
4 2019-03-01 4 62.6
5 2019-03-01 5 62.8
6 2019-03-01 6 64.0
7 2019-03-01 7 65.2
8 2019-03-01 8 66.4
9 2019-03-01 9 67.3
10 2019-04-01 1 22.2
11 2019-04-01 2 38.2
12 2019-04-01 3 46.1
13 2019-04-01 4 56.3
14 2019-04-01 5 63.4
15 2019-04-01 6 69.5
16 2019-04-01 7 74.5
Using purrr::accumulate2()
:使用
purrr::accumulate2()
:
library(purrr)
library(dplyr)
df_input %>%
group_by(cohort) %>%
mutate(CLV = flatten_dbl(accumulate2(CLV, CLV_for[-1], .f = ~ if(!is.na(..2)) ..2 else ..1 + ..3))) %>%
select(-CLV_for)
# A tibble: 16 x 3
# Groups: cohort [2]
cohort months CLV
<chr> <dbl> <dbl>
1 2019-03-01 1 59.9
2 2019-03-01 2 61.1
3 2019-03-01 3 62.1
4 2019-03-01 4 62.6
5 2019-03-01 5 62.8
6 2019-03-01 6 64.0
7 2019-03-01 7 65.2
8 2019-03-01 8 66.4
9 2019-03-01 9 67.3
10 2019-04-01 1 22.2
11 2019-04-01 2 38.2
12 2019-04-01 3 46.1
13 2019-04-01 4 56.3
14 2019-04-01 5 63.4
15 2019-04-01 6 69.5
16 2019-04-01 7 74.5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.