简体   繁体   English

Dplyr 将活动日期数据重塑为月度水平

[英]Dplyr reshape activity date data to monthly level

I have a df that contains user ID, subscription start and current month, activity date, and activity number.我有一个包含用户 ID、订阅开始和当前月份、活动日期和活动编号的 df。 Users can appear more than once if they have multiple activities.如果用户有多个活动,他们可以出现多次。 Below is a short toy example:下面是一个简短的玩具示例:

USER_ID      SUB_START     CURRENT_MONTH     ACTIVITY_DATE    ACTIVITY_NUMBER
  0102       2020-04-01     2020-08-01        2020-02-05            1
  0102       2020-04-01     2020-08-01        2020-03-10            2
  0102       2020-04-01     2020-08-01        2020-07-01            3
  2190       2019-05-10     2020-08-01        2017-01-02            1
  2190       2019-05-10     2020-08-01        2017-10-02            2
  0121       2020-07-13     2020-08-01        2018-01-04            1
  0121       2020-07-13     2020-08-01        2019-02-10            2
  0121       2020-07-13     2020-08-01        2020-01-02            3
  0121       2020-07-13     2020-08-01        2020-04-10            4

What I would like to accomplish is group by month and then show the number of unique ID's that had an active subscription that month and the number of unique id's that had an activity date within the previous 13 months of that month.我想要完成的是按月分组,然后显示该月具有有效订阅的唯一 ID 的数量以及该月前 13 个月内具有活动日期的唯一 ID 的数量。 So the output for this toy data set would look like:所以这个玩具数据集的 output 看起来像:

  MONTH       ACTIVE_COUNT    ACTIVITY_COUNT
2019-05-01         1               0          *user 2190 active with no activity within past 13 mo
2019-06-01         1               0          *user 2190 active with no activity within past 13 mo
2019-07-01         1               0          *user 2190 active with no activity within past 13 mo
2019-08-01         1               0          *user 2190 active with no activity within past 13 mo
2019-09-01         1               0          *user 2190 active with no activity within past 13 mo
2019-10-01         1               0          *user 2190 active with no activity within past 13 mo
2019-11-01         1               0          *user 2190 active with no activity within past 13 mo
2019-12-01         1               0          *user 2190 active with no activity within past 13 mo
2020-01-01         1               0          *user 2190 active with no activity within past 13 mo
2020-02-01         1               0          *user 2190 active with no activity within past 13 mo
2020-03-01         1               0          *user 2190 active with no activity within past 13 mo
2020-04-01         2               1          *user 2190 and 0102 active and 0102 has a qualifying activity
2020-05-01         2               1          *user 2190 and 0102 active and 0102 has a qualifying activity
2020-06-01         2               1          *user 2190 and 0102 active and 0102 has a qualifying activity
2020-07-01         3               2          *user 2190,0102,0121 all active and 0102 and 0121 have qualifying activities

So far I have put together the following code based on a previous project that gives me each user and a row for every month between their SUB_START and CURRENT_MONTH.到目前为止,我已经根据以前的项目汇总了以下代码,该项目为我提供了每个用户以及每个月在他们的 SUB_START 和 CURRENT_MONTH 之间的一行。 The problem is it repeats that process for every ACTIVITY_DATE so each USER_ID has multiple groups of their active months.问题是它为每个 ACTIVITY_DATE 重复该过程,因此每个 USER_ID 都有多个活动月份组。 I am looking to try and have one row for every month each user is active and then add a column for if that user had an ACTIVITY_DATE within 13 months of that month.我希望尝试为每个用户活动的每个月设置一行,然后添加一列以说明该用户在该月的 13 个月内是否有 ACTIVITY_DATE。

df_monthly <- df %>%
                  select(USER_ID,SUB_START, CURRENT_MONTH, ACTIVITY_DATE) %>%
                  mutate(across(where(is.character), ~ floor_date(as.Date(.x) - 1, "months") + 1)) %>%
                  rowwise() %>%
                  mutate(MONTH = list(seq(SUB_START,CURRENT_MONTH, by = "+1 month"))) %>%
                  unnest(MONTH) %>%
                  mutate(MONTH2 = floor_date(MONTH, unit="month"))

You need to loop on the "current month", then for each of these, compute the last activity of each user, and finally count the number of users with subscription and recent activity.您需要循环“当前月份”,然后对于其中的每一个,计算每个用户的最后一次活动,最后计算订阅和最近活动的用户数。

This should do approximately what you want:这应该做大约你想要的:

library(tidyverse)
library(lubridate)
# recreate your dataframe
df <- "USER_ID,      SUB_START,     CURRENT_MONTH ,   ACTIVITY_DATE ,   ACTIVITY_NUMBER
0102,       2020-04-01,     2020-08-01,        2020-02-05            ,1
0102,       2020-04-01,     2020-08-01 ,       2020-03-10            ,2
0102,       2020-04-01,     2020-08-01,      2020-07-01            ,3
2190,       2019-05-10,     2020-08-01,        2017-01-02            ,1
2190,       2019-05-10,     2020-08-01,        2017-10-02            ,2
0121,       2020-07-13,     2020-08-01,        2018-01-04            ,1
0121,       2020-07-13,     2020-08-01,        2019-02-10            ,2
0121,       2020-07-13,     2020-08-01,        2020-01-02            ,3
0121,       2020-07-13,     2020-08-01,        2020-04-10            ,4" %>%
  str_remove_all(" ") %>%
  read_csv()

seq(min(df$SUB_START), max(df$CURRENT_MONTH), by = "+1 month") %>%
  map_dfr(~ df %>%
            group_by(USER_ID, SUB_START) %>%
            summarize(LAST_ACTIVITY = max(ACTIVITY_DATE), .groups="drop") %>%
            mutate(TIME_SINCE_LAST = .x - LAST_ACTIVITY) %>%
            summarize(n_users_subscribed = sum(SUB_START <= .x),
                      n_recently_active = sum(TIME_SINCE_LAST < dmonths(13) &
                                                TIME_SINCE_LAST >= 0)) %>%
            add_column(month = .x)
  )

One difference with your example data is that I don't count user 0121 on 2020-07-01 as they joined on 13th, you might need to work on the roundings (apply your floor_date before processing perhaps?).与您的示例数据的一个区别是,我不计算 2020-07-01 的用户 0121,因为他们在 13 日加入,您可能需要进行四舍五入(也许在处理之前应用您的floor_date ?)。

Note: your approach with nesting should work too, I couldn't try it (probably because you have characters when reading the dataframe), but you probably just need to preprocess the nested dataframe before unnesting to keep only the last activity date per user.注意:您的嵌套方法也应该有效,我无法尝试(可能是因为您在读取数据框时有字符),但您可能只需要在取消嵌套之前预处理嵌套的 dataframe 以仅保留每个用户的最后一个活动日期。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM