Python Pandas 计算日期之间的平均天数

Question

Working with the following python pandas dataframe df:使用以下 python pandas 数据帧 df：

Customer_ID | Transaction_ID
ABC            2016-05-06-1234
ABC            2017-06-08-3456
ABC            2017-07-12-5678
ABC            2017-12-20-6789
BCD            2016-08-23-7891
BCD            2016-09-21-2345
BCD            2017-10-23-4567

The date is unfortunately hidden in the transaction_id string.不幸的是，日期隐藏在 transaction_id 字符串中。 I edited the dataframe this way.我以这种方式编辑了数据框。

#year of transaction
df['year'] = df['Transaction_ID'].astype(str).str[:4]

#date of transaction
df['date'] = df['Transaction_ID'].astype(str).str[:10]

#format date
df['date']=pd.to_datetime(df['date'], format='%Y-%m-%d')

#calculate visit number per year
df['visit_nr_yr'] = df.groupby(['Customer_ID', 'year']).cumcount()+1

Now the df looks like this:现在 df 看起来像这样：

Customer_ID | Transaction_ID    | year  | date        |visit_nr_yr 
ABC            2016-05-06-1234    2016    2016-05-06    1            
ABC            2017-06-08-3456    2017    2017-06-08    1            
ABC            2017-07-12-5678    2017    2017-07-12    2            
ABC            2017-12-20-6789    2017    2017-12-20    3            
BCD            2016-08-23-7891    2016    2016-08-23    1            
BCD            2016-09-21-2345    2016    2016-09-21    2            
BCD            2017-10-23-4567    2017    2017-10-23    1

I need to calculate the following:我需要计算以下内容：

What's the average days between visits by visit (so between 1&2 and between 2&3)每次访问之间的平均天数是多少（所以在 1&2 和 2&3 之间）
What's the average days between visits in general一般访问之间的平均天数是多少

First I would like to include the following column "days_between_visits_by year" (math to be done by Customer_ID):首先，我想包括以下列“days_between_visits_by year”（由 Customer_ID 完成的数学运算）：

Customer_ID|Transaction_ID  |year| date       |visit_nr_yr|days_bw_visits_yr 
ABC         2016-05-06-1234  2016  2016-05-06   1             NaN
ABC         2017-06-08-3456  2017  2017-06-08   1             NaN
ABC         2017-07-12-5678  2017  2017-07-12   2             34
ABC         2017-12-20-6789  2017  2017-12-20   3             161
BCD         2016-08-23-7891  2016  2016-08-23   1             NaN
BCD         2016-09-21-2345  2016  2016-09-21   2             29
BCD         2017-10-23-4567  2017  2017-10-23   1             NaN

Please note that I avoided 0s on purpose and kept the Nans, in case somebody had two visits on the same day.请注意，我故意避免使用 0 并保留 Nan，以防有人在同一天进行两次访问。

Next I want to calculate the average days between visits by visit (so between 1&2 and between 2&3 within a year).接下来，我想通过访问计算访问之间的平均天数（因此在一年内的 1&2 和 2&3 之间）。 Looking for this output:寻找这个输出：

avg_days_bw_visits_1_2 | avg_days_bw_visits_2_3
31.5                     161

Finally, I want to calculate the average days between visits in general:最后，我想计算一般访问之间的平均天数：

output: 203.8 
#the days between visits are 398,34,161,29,397 and the average of those 
 numbers is 203.8

I'm stuck with at the how to create the column "days_bw_visits_yr".我被困在如何创建列“days_bw_visits_yr”。 Nans have to be excluded from the math. Nans 必须被排除在数学之外。

Answer 1

You can get previous visit date (grouped by customer and year) by shifting the "date" column down by 1:您可以通过将“日期”列向下移动 1 来获取上次访问日期（按客户和年份分组）：

df['previous_visit'] = df.groupby(['Customer_ID', 'year'])['date'].shift()

From this, days between visits is simply the difference:由此，访问之间的天数就是区别：

df['days_bw_visits'] = df['date'] - df['previous_visit']

To calculate mean, convert the date delta object to days:要计算平均值，请将日期增量对象转换为天数：

df['days_bw_visits'] = df['days_bw_visits'].apply(lambda x: x.days)

Average days between visits:访问之间的平均天数：

df.groupby('visit_nr_yr')['days_bw_visits'].agg('mean')

df['days_bw_visits'].mean()

Answer 2

Source DF:来源DF：

In [96]: df
Out[96]:
  Customer_ID   Transaction_ID
0         ABC  2016-05-06-1234
1         ABC  2017-06-08-3456
2         ABC  2017-07-12-5678
3         ABC  2017-12-20-6789
4         BCD  2016-08-23-7891
5         BCD  2016-09-21-2345
6         BCD  2017-10-23-4567

Solution:解决方法：

df['Date'] = pd.to_datetime(df.Transaction_ID.str[:10])
df['visit_nr_yr'] = df.groupby(['Customer_ID', df['Date'].dt.year]).cumcount()+1
df['days_bw_visits_yr'] = \
    df.groupby(['Customer_ID', df['Date'].dt.year])['Date'].diff().dt.days

Result:结果：

In [98]: df
Out[98]:
  Customer_ID   Transaction_ID       Date  visit_nr_yr  days_bw_visits_yr
0         ABC  2016-05-06-1234 2016-05-06            1                NaN
1         ABC  2017-06-08-3456 2017-06-08            1                NaN
2         ABC  2017-07-12-5678 2017-07-12            2               34.0
3         ABC  2017-12-20-6789 2017-12-20            3              161.0
4         BCD  2016-08-23-7891 2016-08-23            1                NaN
5         BCD  2016-09-21-2345 2016-09-21            2               29.0
6         BCD  2017-10-23-4567 2017-10-23            1                NaN

Answer 3

Worth noting that, in addition to getting the time diff between last purchase值得注意的是，除了获得上次购买之间的时间差异

df['previous_visit'] = df.groupby(['Customer_ID', 'year'])['date'].shift()
df['days_bw_visits'] = df['date'] - df['previous_visit'] 
df['days_bw_visits'] = df['days_bw_visits'].apply(lambda x: x.days)

you should make sure your dates are sorted by your group value prior to executing .shift() to avoid negative days_bw_visits在执行 .shift() 之前，您应该确保您的日期按您的组值排序以避免负 days_bw_visits

df = df.sort_values(['Customer_ID', 'DATE_D'])

Python Pandas 计算日期之间的平均天数

问题描述

3 个解决方案

解决方案1
10 已采纳 2017-07-21 16:51:30

解决方案2
1 2017-07-21 17:21:35

解决方案3
1 2020-02-27 00:44:52

Python Pandas 计算日期之间的平均天数

问题描述

3 个解决方案

解决方案1 10 已采纳 2017-07-21 16:51:30

解决方案2 1 2017-07-21 17:21:35

解决方案3 1 2020-02-27 00:44:52

解决方案1
10 已采纳 2017-07-21 16:51:30

解决方案2
1 2017-07-21 17:21:35

解决方案3
1 2020-02-27 00:44:52