[英]Create Pandas TimeSeries from Data, Period-range and aggregation function
I'd like to create a time series (with pandas), to count distinct value of an Id if start and end date are within the considered date.如果开始日期和结束日期在考虑的日期内,我想创建一个时间序列(使用熊猫)来计算 Id 的不同值。
For sake of legibility, this is a simplified version of the problem.为了便于阅读,这是问题的简化版本。
Let's define the Data this way:让我们以这种方式定义数据:
df = pd.DataFrame({
'customerId': [
'1', '1', '1', '2', '2'
],
'id': [
'1', '2', '3', '1', '2'
],
'startDate': [
'2000-01', '2000-01', '2000-04', '2000-05', '2000-06',
],
'endDate': [
'2000-08', '2000-02', '2000-07', '2000-07', '2000-08',
],
})
And the period range this way:周期范围是这样的:
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
For each customerId, there are several distinct id.对于每个 customerId,有几个不同的 id。 The final aim is to get, for each
date
of the period-range, for each customerId
, the count of distinct id
whose start_date
and end_date
matches the function my_date_predicate
.最终目标是为每个
customerId
获取周期范围的每个date
,其start_date
和end_date
与 function my_date_predicate
匹配的不同id
的计数。
Simplified definition of my_date_predicate
: my_date_predicate
的简化定义:
unset_date = pd.to_datetime("1900-01")
def my_date_predicate(date, row):
return row.startDate <= date and \
(row.endDate.equals(unset_date) or row.endDate > date)
I'd like a time series result like this:我想要这样的时间序列结果:
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 0
How could I use pandas to get such result?我如何使用 pandas 来获得这样的结果?
Here's a solution:这是一个解决方案:
df.startDate = pd.to_datetime(df.startDate)
df.endDate = pd.to_datetime(df.endDate)
df["month"] = df.apply(lambda row: pd.date_range(row["startDate"], row["endDate"], freq="MS", closed = "left"), axis=1)
df = df.explode("month")
period_range = pd.period_range(start='2000-01', end='2000-07', freq='M')
t = pd.DataFrame(period_range.to_timestamp(), columns=["month"])
customers_df = pd.DataFrame(df.customerId.unique(), columns = ["customerId"])
t = pd.merge(t.assign(dummy=1), customers_df.assign(dummy=1), on = "dummy").drop("dummy", axis=1)
t = pd.merge(t, df, on = ["customerId", "month"], how = "left")
t.groupby(["month", "customerId"]).count()[["id"]].rename(columns={"id": "count"})
The result is:结果是:
count
month customerId
2000-01-01 1 2
2 0
2000-02-01 1 1
2 0
2000-03-01 1 1
2 0
2000-04-01 1 2
2 0
2000-05-01 1 2
2 1
2000-06-01 1 2
2 2
2000-07-01 1 1
2 1
Note:笔记:
You can do it with 2 pivot_table
to get the count
of id per customer in column per start date (and end date) in index.您可以使用 2
pivot_table
来获取索引中每个开始日期(和结束日期)列中每个客户的 id count
。 reindex
each one with the period_date you are interested in. Substract the pivot for end from the pivot for start.用您感兴趣的 period_date
reindex
每一个。从 pivot 中减去 pivot 作为开始。 Use cumsum
to get the cumulative some of id per customer id.使用
cumsum
获取每个客户 ID 的累积部分 ID。 Finally use stack
and reset_index
to bring to the wanted shape.最后使用
stack
和reset_index
来达到想要的形状。
#convert to period columns like period_date
df['startDate'] = pd.to_datetime(df['startDate']).dt.to_period('M')
df['endDate'] = pd.to_datetime(df['endDate']).dt.to_period('M')
#create the pivots
pvs = (df.pivot_table(index='startDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
pve = (df.pivot_table(index='endDate', columns='customerId', values='id',
aggfunc='count', fill_value=0)
.reindex(period_range, fill_value=0)
)
print (pvs)
customerId 1 2
2000-01 2 0 #two id for customer 1 that start at this month
2000-02 0 0
2000-03 0 0
2000-04 1 0
2000-05 0 1 #one id for customer 2 that start at this month
2000-06 0 1
2000-07 0 0
Now you can substract one to the other and use cumsum
to get the wanted amount per date.现在您可以将一个减去另一个并使用
cumsum
来获得每个日期的所需金额。
res = (pvs - pve).cumsum().stack().reset_index()
res.columns = ['date', 'customerId','customerCount']
print (res)
date customerId customerCount
0 2000-01 1 2
1 2000-01 2 0
2 2000-02 1 1
3 2000-02 2 0
4 2000-03 1 1
5 2000-03 2 0
6 2000-04 1 2
7 2000-04 2 0
8 2000-05 1 2
9 2000-05 2 1
10 2000-06 1 2
11 2000-06 2 2
12 2000-07 1 1
13 2000-07 2 1
Note really sure how to handle the unset_date
as I don't see what is used for请注意非常确定如何处理
unset_date
因为我看不到它的用途
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.