計算 Pandas 列中值的 3 個月滾動計數

Question

我有以下數據框（這是數據框的簡化版本，但邏輯是相同的）：

#MONTH = yyyy-mm-dd

    MONTH        User
0   2021-04-01   A
1   2021-04-01   B
2   2021-05-01   B
3   2021-06-01   A
4   2021-06-01   B
5   2021-07-01   A
6   2021-07-01   B
7   2021-08-01   A
8   2021-08-01   B

我想要的是計算用戶是否在 3 個月滾動的基礎上處於活動狀態。

例如，用戶B如果我們考慮 6 月 (2021-06-01)，我們可以看到他在 5 月和 4 月活躍，因此在 3M 滾動的基礎上，他被認為在 6 月活躍。 而同一時期的用戶A在三個月中的一個月內沒有活躍，因此在 6 月他將不被視為活躍。

期望的輸出是有一個列來計算每個月的活躍用戶數（300 萬滾動），例如基於上述數據：

    MONTH        Active_User_Count
0   2021-04-01   NaN
1   2021-05-01   NaN
2   2021-06-01   1
3   2021-07-01   1
4   2021-08-01   2

我仍在努力了解滾動數據，所以如果有人能幫助我，那就太好了！ 提前致謝！

編輯MONTH列只有每個月第一天的值，但那天有多個用戶。 所以沒有2021-04-30，都是每月的第一天。

Answer 1

好吧，讓我們試試這個。 假設一個名為df的pandas.DataFrame ，它有一個pandas.Timestamp類型的MONTH列，以及一個我們可以groupby的User列：

import pandas as pd
import numpy as np

df = #[however you got your data here]
df.MONTH = df.MONTH.apply(pd.Timestamp)

所以例如

>>> df
       MONTH User
0 2021-04-01    A
1 2021-04-01    B
2 2021-05-01    B
3 2021-06-01    A
4 2021-06-01    B
5 2021-07-01    A
6 2021-07-01    B
7 2021-08-01    A
8 2021-08-01    B

然后給定以上，讓我們制作一個 DataFrame 來保存我們的結果，從輸入DataFrame的開始到結束連續幾個月，並將活動用戶計數列初始化為 0：

res = pd.DataFrame(pd.date_range(df.MONTH.min(),df.MONTH.max(),freq='MS'),columns=['MONTH'])
res['Active_User_Count'] = 0
res = res.set_index('MONTH').sort_index()

現在添加值：

for user, frame in df.groupby(by='User'):
    # make a helper column, that has an indicator of whether the user
    # was active that month (value='both') or not (value='right_only')
    frame = frame.merge(
                     pd.Series(pd.date_range(start=frame.MONTH.min(),\
                                        end=frame.MONTH.max(),\
                                        freq='MS'),\
                               name='MONTH'),\
                     on='MONTH',how='outer',indicator=True)\
                 .set_index('MONTH').sort_index()
    # this is where the magic happens;
    # categorize the '_merge' results (0 = left_only, 1 = right_only, 2 = both)
    # then on a 3-wide rolling window, get the minimum value
    # check that it is greater than 1.5 (i.e. all three prev months
    # are _merge value 'both')
    # if it's not > 1.5, then the user wasn't active for all 3 months
    
    # finally take the result from that rolling.min.apply,
    # and funnel into a numpy.where array, which sets
    # 'Active_User_Count' of the in-process user frame
    # to an array of 1s and 0s
    frame['Active_User_Count'] = np.where(
        (frame._merge
              .astype('category').cat.codes
              .rolling(3).min().apply(lambda x: x > 1.5)), 1, 0)
    
    # add the current-user activity into the total result
    res.Active_User_Count[frame.index] += frame.Active_User_Count

# some re-formatting
res = res.reset_index().sort_index()

畢竟我們得到了我們的輸出：

>>> res
       MONTH  Active_User_Count
0 2021-04-01                  0
1 2021-05-01                  0
2 2021-06-01                  1
3 2021-07-01                  1
4 2021-08-01                  2

TL; 博士

這是一個函數來做這件事

import pandas as pd
import numpy as np

def active_users(df):
    res = pd.DataFrame(pd.date_range(df.MONTH.min(),\
                                     df.MONTH.max(),\
                                     freq='MS'),\
                       columns=['MONTH'])
    res['Active_User_Count'] = 0
    res = res.set_index('MONTH').sort_index()
    
    for user, frame in df.groupby(by='User'):
            frame = frame.merge(pd.Series(
                                    pd.date_range(start=frame.MONTH.min(),\
                                            end=frame.MONTH.max(),\
                                            freq='MS'),\
                                    name='MONTH'),\
                                on='MONTH',\
                                how='outer',\
                                indicator=True)\
                         .set_index('MONTH').sort_index()
            frame['Active_User_Count'] = np.where(
                (frame._merge
                      .astype('category')
                      .cat.codes
                      .rolling(3).min().apply(lambda x: x > 1.5)), 1, 0)
            res.Active_User_Count[frame.index] += frame.Active_User_Count
    
    return res.reset_index().sort_index()

計算 Pandas 列中值的 3 個月滾動計數

問題描述

1 個解決方案

解決方案1
0 2021-10-21 17:38:38

TL; 博士

計算 Pandas 列中值的 3 個月滾動計數

問題描述

1 個解決方案

解決方案1 0 2021-10-21 17:38:38

TL; 博士

解決方案1
0 2021-10-21 17:38:38