[英]Python Pandas Calculate average days between dates
Working with the following python pandas dataframe df:使用以下 python pandas 数据帧 df:
Customer_ID | Transaction_ID
ABC 2016-05-06-1234
ABC 2017-06-08-3456
ABC 2017-07-12-5678
ABC 2017-12-20-6789
BCD 2016-08-23-7891
BCD 2016-09-21-2345
BCD 2017-10-23-4567
The date is unfortunately hidden in the transaction_id string.不幸的是,日期隐藏在 transaction_id 字符串中。 I edited the dataframe this way.
我以这种方式编辑了数据框。
#year of transaction
df['year'] = df['Transaction_ID'].astype(str).str[:4]
#date of transaction
df['date'] = df['Transaction_ID'].astype(str).str[:10]
#format date
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d')
#calculate visit number per year
df['visit_nr_yr'] = df.groupby(['Customer_ID', 'year']).cumcount()+1
Now the df looks like this:现在 df 看起来像这样:
Customer_ID | Transaction_ID | year | date |visit_nr_yr
ABC 2016-05-06-1234 2016 2016-05-06 1
ABC 2017-06-08-3456 2017 2017-06-08 1
ABC 2017-07-12-5678 2017 2017-07-12 2
ABC 2017-12-20-6789 2017 2017-12-20 3
BCD 2016-08-23-7891 2016 2016-08-23 1
BCD 2016-09-21-2345 2016 2016-09-21 2
BCD 2017-10-23-4567 2017 2017-10-23 1
I need to calculate the following:我需要计算以下内容:
First I would like to include the following column "days_between_visits_by year" (math to be done by Customer_ID):首先,我想包括以下列“days_between_visits_by year”(由 Customer_ID 完成的数学运算):
Customer_ID|Transaction_ID |year| date |visit_nr_yr|days_bw_visits_yr
ABC 2016-05-06-1234 2016 2016-05-06 1 NaN
ABC 2017-06-08-3456 2017 2017-06-08 1 NaN
ABC 2017-07-12-5678 2017 2017-07-12 2 34
ABC 2017-12-20-6789 2017 2017-12-20 3 161
BCD 2016-08-23-7891 2016 2016-08-23 1 NaN
BCD 2016-09-21-2345 2016 2016-09-21 2 29
BCD 2017-10-23-4567 2017 2017-10-23 1 NaN
Please note that I avoided 0s on purpose and kept the Nans, in case somebody had two visits on the same day.请注意,我故意避免使用 0 并保留 Nan,以防有人在同一天进行两次访问。
Next I want to calculate the average days between visits by visit (so between 1&2 and between 2&3 within a year).接下来,我想通过访问计算访问之间的平均天数(因此在一年内的 1&2 和 2&3 之间)。 Looking for this output:
寻找这个输出:
avg_days_bw_visits_1_2 | avg_days_bw_visits_2_3
31.5 161
Finally, I want to calculate the average days between visits in general:最后,我想计算一般访问之间的平均天数:
output: 203.8
#the days between visits are 398,34,161,29,397 and the average of those
numbers is 203.8
I'm stuck with at the how to create the column "days_bw_visits_yr".我被困在如何创建列“days_bw_visits_yr”。 Nans have to be excluded from the math.
Nans 必须被排除在数学之外。
You can get previous visit date (grouped by customer and year) by shifting the "date" column down by 1:您可以通过将“日期”列向下移动 1 来获取上次访问日期(按客户和年份分组):
df['previous_visit'] = df.groupby(['Customer_ID', 'year'])['date'].shift()
From this, days between visits is simply the difference:由此,访问之间的天数就是区别:
df['days_bw_visits'] = df['date'] - df['previous_visit']
To calculate mean, convert the date delta object to days:要计算平均值,请将日期增量对象转换为天数:
df['days_bw_visits'] = df['days_bw_visits'].apply(lambda x: x.days)
Average days between visits:访问之间的平均天数:
df.groupby('visit_nr_yr')['days_bw_visits'].agg('mean')
df['days_bw_visits'].mean()
Source DF:来源DF:
In [96]: df
Out[96]:
Customer_ID Transaction_ID
0 ABC 2016-05-06-1234
1 ABC 2017-06-08-3456
2 ABC 2017-07-12-5678
3 ABC 2017-12-20-6789
4 BCD 2016-08-23-7891
5 BCD 2016-09-21-2345
6 BCD 2017-10-23-4567
Solution:解决方法:
df['Date'] = pd.to_datetime(df.Transaction_ID.str[:10])
df['visit_nr_yr'] = df.groupby(['Customer_ID', df['Date'].dt.year]).cumcount()+1
df['days_bw_visits_yr'] = \
df.groupby(['Customer_ID', df['Date'].dt.year])['Date'].diff().dt.days
Result:结果:
In [98]: df
Out[98]:
Customer_ID Transaction_ID Date visit_nr_yr days_bw_visits_yr
0 ABC 2016-05-06-1234 2016-05-06 1 NaN
1 ABC 2017-06-08-3456 2017-06-08 1 NaN
2 ABC 2017-07-12-5678 2017-07-12 2 34.0
3 ABC 2017-12-20-6789 2017-12-20 3 161.0
4 BCD 2016-08-23-7891 2016-08-23 1 NaN
5 BCD 2016-09-21-2345 2016-09-21 2 29.0
6 BCD 2017-10-23-4567 2017-10-23 1 NaN
Worth noting that, in addition to getting the time diff between last purchase值得注意的是,除了获得上次购买之间的时间差异
df['previous_visit'] = df.groupby(['Customer_ID', 'year'])['date'].shift()
df['days_bw_visits'] = df['date'] - df['previous_visit']
df['days_bw_visits'] = df['days_bw_visits'].apply(lambda x: x.days)
you should make sure your dates are sorted by your group value prior to executing .shift() to avoid negative days_bw_visits在执行 .shift() 之前,您应该确保您的日期按您的组值排序以避免负 days_bw_visits
df = df.sort_values(['Customer_ID', 'DATE_D'])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.