从数据、周期范围和聚合 function 创建 Pandas TimeSeries

Question

Context语境

I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.如果开始日期和结束日期在考虑的日期内，我想创建一个时间序列（使用熊猫）来计算 Id 的不同值。

For sake of legibility, this is a simplified version of the problem.为了便于阅读，这是问题的简化版本。

Data数据

Let's define the Data this way:让我们以这种方式定义数据：

df = pd.DataFrame({
    'customerId': [
        '1', '1', '1', '2', '2'
    ],
    'id': [
        '1', '2', '3', '1', '2'
    ],
    'startDate': [
        '2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
    ],
    'endDate': [
        '2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
    ],
})

And the period range this way:周期范围是这样的：

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

Objectives目标

For each customerId, there are several distinct id.对于每个 customerId，有几个不同的 id。 The final aim is to get, for each date of the period-range, for each customerId , the count of distinct id whose start_date and end_date matches the function my_date_predicate .最终目标是为每个customerId获取周期范围的每个date ，其start_date和end_date与 function my_date_predicate匹配的不同id的计数。

Simplified definition of my_date_predicate : my_date_predicate的简化定义：

unset_date = pd.to_datetime("1900-01")


def my_date_predicate(date, row):
    return row.startDate <= date and \
           (row.endDate.equals(unset_date) or row.endDate > date)

Awaited result等待结果

I'd like a time series result like this:我想要这样的时间序列结果：

        date customerId customerCount
0   2000-01          1             2
1   2000-01          2             0
2   2000-02          1             1
3   2000-02          2             0
4   2000-03          1             1
5   2000-03          2             0
6   2000-04          1             2
7   2000-04          2             0
8   2000-05          1             2
9   2000-05          2             1
10  2000-06          1             2
11  2000-06          2             2
12  2000-07          1             1
13  2000-07          2             0

Question问题

How could I use pandas to get such result?我如何使用 pandas 来获得这样的结果？

Answer 1

Here's a solution:这是一个解决方案：

df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")

period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')

t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})

The result is:结果是：

                       count
month      customerId       
2000-01-01 1               2
           2               0
2000-02-01 1               1
           2               0
2000-03-01 1               1
           2               0
2000-04-01 1               2
           2               0
2000-05-01 1               2
           2               1
2000-06-01 1               2
           2               2
2000-07-01 1               1
           2               1

Note:笔记：

For unset dates, replace the end date with the very last date you're interested in before you start the calculation.对于未设置的日期，请在开始计算之前将结束日期替换为您感兴趣的最后一个日期。

Answer 2

You can do it with 2 pivot_table to get the count of id per customer in column per start date (and end date) in index.您可以使用 2 pivot_table来获取索引中每个开始日期（和结束日期）列中每个客户的 id count 。 reindex each one with the period_date you are interested in. Substract the pivot for end from the pivot for start.用您感兴趣的 period_date reindex每一个。从 pivot 中减去 pivot 作为开始。 Use cumsum to get the cumulative some of id per customer id.使用cumsum获取每个客户 ID 的累积部分 ID。 Finally use stack and reset_index to bring to the wanted shape.最后使用stack和reset_index来达到想要的形状。

#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')

#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
pve = (df.pivot_table(index='endDate', columns='customerId', values='id', 
                      aggfunc='count', fill_value=0)
         .reindex(period_range, fill_value=0)
      )
print (pvs)
customerId  1  2
2000-01     2  0 #two id for customer 1 that start at this month
2000-02     0  0
2000-03     0  0
2000-04     1  0
2000-05     0  1 #one id for customer 2 that start at this month
2000-06     0  1
2000-07     0  0

Now you can substract one to the other and use cumsum to get the wanted amount per date.现在您可以将一个减去另一个并使用cumsum来获得每个日期的所需金额。

res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
       date customerId  customerCount
0   2000-01          1              2
1   2000-01          2              0
2   2000-02          1              1
3   2000-02          2              0
4   2000-03          1              1
5   2000-03          2              0
6   2000-04          1              2
7   2000-04          2              0
8   2000-05          1              2
9   2000-05          2              1
10  2000-06          1              2
11  2000-06          2              2
12  2000-07          1              1
13  2000-07          2              1

Note really sure how to handle the unset_date as I don't see what is used for请注意非常确定如何处理unset_date因为我看不到它的用途

从数据、周期范围和聚合 function 创建 Pandas TimeSeries

问题描述

Context语境

Data数据

Objectives目标

Awaited result等待结果

Question问题

2 个解决方案

解决方案1
2 2020-06-17 09:39:25

解决方案2
1 2020-06-17 18:39:57

从数据、周期范围和聚合 function 创建 Pandas TimeSeries

问题描述

Context语境

Data数据

Objectives目标

Awaited result等待结果

Question问题

2 个解决方案

解决方案1 2 2020-06-17 09:39:25

解决方案2 1 2020-06-17 18:39:57

解决方案1
2 2020-06-17 09:39:25

解决方案2
1 2020-06-17 18:39:57