[英]Dplyr reshape activity date data to monthly level
我有一個包含用戶 ID、訂閱開始和當前月份、活動日期和活動編號的 df。 如果用戶有多個活動,他們可以出現多次。 下面是一個簡短的玩具示例:
USER_ID SUB_START CURRENT_MONTH ACTIVITY_DATE ACTIVITY_NUMBER
0102 2020-04-01 2020-08-01 2020-02-05 1
0102 2020-04-01 2020-08-01 2020-03-10 2
0102 2020-04-01 2020-08-01 2020-07-01 3
2190 2019-05-10 2020-08-01 2017-01-02 1
2190 2019-05-10 2020-08-01 2017-10-02 2
0121 2020-07-13 2020-08-01 2018-01-04 1
0121 2020-07-13 2020-08-01 2019-02-10 2
0121 2020-07-13 2020-08-01 2020-01-02 3
0121 2020-07-13 2020-08-01 2020-04-10 4
我想要完成的是按月分組,然后顯示該月具有有效訂閱的唯一 ID 的數量以及該月前 13 個月內具有活動日期的唯一 ID 的數量。 所以這個玩具數據集的 output 看起來像:
MONTH ACTIVE_COUNT ACTIVITY_COUNT
2019-05-01 1 0 *user 2190 active with no activity within past 13 mo
2019-06-01 1 0 *user 2190 active with no activity within past 13 mo
2019-07-01 1 0 *user 2190 active with no activity within past 13 mo
2019-08-01 1 0 *user 2190 active with no activity within past 13 mo
2019-09-01 1 0 *user 2190 active with no activity within past 13 mo
2019-10-01 1 0 *user 2190 active with no activity within past 13 mo
2019-11-01 1 0 *user 2190 active with no activity within past 13 mo
2019-12-01 1 0 *user 2190 active with no activity within past 13 mo
2020-01-01 1 0 *user 2190 active with no activity within past 13 mo
2020-02-01 1 0 *user 2190 active with no activity within past 13 mo
2020-03-01 1 0 *user 2190 active with no activity within past 13 mo
2020-04-01 2 1 *user 2190 and 0102 active and 0102 has a qualifying activity
2020-05-01 2 1 *user 2190 and 0102 active and 0102 has a qualifying activity
2020-06-01 2 1 *user 2190 and 0102 active and 0102 has a qualifying activity
2020-07-01 3 2 *user 2190,0102,0121 all active and 0102 and 0121 have qualifying activities
到目前為止,我已經根據以前的項目匯總了以下代碼,該項目為我提供了每個用戶以及每個月在他們的 SUB_START 和 CURRENT_MONTH 之間的一行。 問題是它為每個 ACTIVITY_DATE 重復該過程,因此每個 USER_ID 都有多個活動月份組。 我希望嘗試為每個用戶活動的每個月設置一行,然后添加一列以說明該用戶在該月的 13 個月內是否有 ACTIVITY_DATE。
df_monthly <- df %>%
select(USER_ID,SUB_START, CURRENT_MONTH, ACTIVITY_DATE) %>%
mutate(across(where(is.character), ~ floor_date(as.Date(.x) - 1, "months") + 1)) %>%
rowwise() %>%
mutate(MONTH = list(seq(SUB_START,CURRENT_MONTH, by = "+1 month"))) %>%
unnest(MONTH) %>%
mutate(MONTH2 = floor_date(MONTH, unit="month"))
您需要循環“當前月份”,然后對於其中的每一個,計算每個用戶的最后一次活動,最后計算訂閱和最近活動的用戶數。
這應該做大約你想要的:
library(tidyverse)
library(lubridate)
# recreate your dataframe
df <- "USER_ID, SUB_START, CURRENT_MONTH , ACTIVITY_DATE , ACTIVITY_NUMBER
0102, 2020-04-01, 2020-08-01, 2020-02-05 ,1
0102, 2020-04-01, 2020-08-01 , 2020-03-10 ,2
0102, 2020-04-01, 2020-08-01, 2020-07-01 ,3
2190, 2019-05-10, 2020-08-01, 2017-01-02 ,1
2190, 2019-05-10, 2020-08-01, 2017-10-02 ,2
0121, 2020-07-13, 2020-08-01, 2018-01-04 ,1
0121, 2020-07-13, 2020-08-01, 2019-02-10 ,2
0121, 2020-07-13, 2020-08-01, 2020-01-02 ,3
0121, 2020-07-13, 2020-08-01, 2020-04-10 ,4" %>%
str_remove_all(" ") %>%
read_csv()
seq(min(df$SUB_START), max(df$CURRENT_MONTH), by = "+1 month") %>%
map_dfr(~ df %>%
group_by(USER_ID, SUB_START) %>%
summarize(LAST_ACTIVITY = max(ACTIVITY_DATE), .groups="drop") %>%
mutate(TIME_SINCE_LAST = .x - LAST_ACTIVITY) %>%
summarize(n_users_subscribed = sum(SUB_START <= .x),
n_recently_active = sum(TIME_SINCE_LAST < dmonths(13) &
TIME_SINCE_LAST >= 0)) %>%
add_column(month = .x)
)
與您的示例數據的一個區別是,我不計算 2020-07-01 的用戶 0121,因為他們在 13 日加入,您可能需要進行四舍五入(也許在處理之前應用您的floor_date
?)。
注意:您的嵌套方法也應該有效,我無法嘗試(可能是因為您在讀取數據框時有字符),但您可能只需要在取消嵌套之前預處理嵌套的 dataframe 以僅保留每個用戶的最后一個活動日期。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.