从数据、周期范围和聚合 function 创建 Pandas TimeSeries

Question

语境

如果开始日期和结束日期在考虑的日期内，我想创建一个时间序列（使用熊猫）来计算 Id 的不同值。

为了便于阅读，这是问题的简化版本。

数据

让我们以这种方式定义数据：

df = pd.DataFrame({
    'customerId': [
        '1', '1', '1', '2', '2'
    ],
    'id': [
        '1', '2', '3', '1', '2'
    ],
    'startDate': [
        '2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
    ],
    'endDate': [
        '2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
    ],
})

周期范围是这样的：

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

目标

对于每个 customerId，有几个不同的 id。 最终目标是为每个customerId获取周期范围的每个date ，其start_date和end_date与 function my_date_predicate匹配的不同id的计数。

my_date_predicate的简化定义：

unset_date = pd.to_datetime("1900-01")


def my_date_predicate(date, row):
    return row.startDate <= date and \
           (row.endDate.equals(unset_date) or row.endDate > date)

等待结果

我想要这样的时间序列结果：

        date customerId customerCount
0   2000-01          1             2
1   2000-01          2             0
2   2000-02          1             1
3   2000-02          2             0
4   2000-03          1             1
5   2000-03          2             0
6   2000-04          1             2
7   2000-04          2             0
8   2000-05          1             2
9   2000-05          2             1
10  2000-06          1             2
11  2000-06          2             2
12  2000-07          1             1
13  2000-07          2             0

问题

我如何使用 pandas 来获得这样的结果？

Answer 1

这是一个解决方案：

df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})

结果是：

                       count
month      customerId       
2000-01-01 1               2
           2               0
2000-02-01 1               1
           2               0
2000-03-01 1               1
           2               0
2000-04-01 1               2
           2               0
2000-05-01 1               2
           2               1
2000-06-01 1               2
           2               2
2000-07-01 1               1
           2               1

笔记：

对于未设置的日期，请在开始计算之前将结束日期替换为您感兴趣的最后一个日期。

Answer 2

您可以使用 2 pivot_table来获取索引中每个开始日期（和结束日期）列中每个客户的 id count 。 用您感兴趣的 period_date reindex每一个。从 pivot 中减去 pivot 作为开始。 使用cumsum获取每个客户 ID 的累积部分 ID。 最后使用stack和reset_index来达到想要的形状。

#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')

#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
pve = (df.pivot_table(index='endDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
print (pvs)
customerId  1  2
2000-01     2  0 #two id for customer 1 that start at this month
2000-02     0  0
2000-03     0  0
2000-04     1  0
2000-05     0  1 #one id for customer 2 that start at this month
2000-06     0  1
2000-07     0  0

现在您可以将一个减去另一个并使用cumsum来获得每个日期的所需金额。

res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
       date customerId  customerCount
0   2000-01          1              2
1   2000-01          2              0
2   2000-02          1              1
3   2000-02          2              0
4   2000-03          1              1
5   2000-03          2              0
6   2000-04          1              2
7   2000-04          2              0
8   2000-05          1              2
9   2000-05          2              1
10  2000-06          1              2
11  2000-06          2              2
12  2000-07          1              1
13  2000-07          2              1

请注意非常确定如何处理unset_date因为我看不到它的用途

从数据、周期范围和聚合 function 创建 Pandas TimeSeries

问题描述

语境

数据

目标

等待结果

问题

2 个解决方案

解决方案1
2 2020-06-17 09:39:25

解决方案2
1 2020-06-17 18:39:57

从数据、周期范围和聚合 function 创建 Pandas TimeSeries

问题描述

语境

数据

目标

等待结果

问题

2 个解决方案

解决方案1 2 2020-06-17 09:39:25

解决方案2 1 2020-06-17 18:39:57

解决方案1
2 2020-06-17 09:39:25

解决方案2
1 2020-06-17 18:39:57