I have a dataframe
below:
import pandas as pd
data = pd.DataFrame({
'ID': ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25',
'2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31',
'2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
})
df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])
df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd')
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days
print(df)
Out [1]:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1
Now I want to create a new column name Average_Delay
based on this assumption:
ID
27459
will be grouped into two sets of 30-days invoice that is 2020-06-26 until 2020-07-25 and 2020-07-30 until 2020-08-02. ID
48002
will also have two sets 30-days period that is 2020-05-13 and 2020-06-20 until 2020-06-28. Average_Delay
is recorded on the final date of the ID 30-days period. Average_Delay
calculation is sum of Delay
divided by number of invoice in the 30-days period. The expected output should look more or less like this: Out [2]:
ID Invoice_Date Payment_Term Payment_Date Due_Date Delay Average_Delay
0 27459 2020-06-26 7 2020-07-05 2020-07-03 2
1 27459 2020-06-29 8 2020-07-05 2020-07-07 -2
2 27459 2020-06-30 3 2020-07-03 2020-07-03 0
3 27459 2020-07-14 6 2020-07-21 2020-07-20 1
4 27459 2020-07-25 4 2020-07-31 2020-07-29 2 0.6
5 27459 2020-07-30 7 2020-08-15 2020-08-06 9
6 27459 2020-08-02 8 2020-08-22 2020-08-10 12 10.5
7 48002 2020-05-13 5 2020-06-16 2020-05-18 29 29
8 48002 2020-06-20 3 2020-06-23 2020-06-23 0
9 48002 2020-06-28 6 2020-07-05 2020-07-04 1 0.5
The Average_Delay
can be calculated using .groupby
and .resample
like:
df.groupby("ID").get_group("27459").resample("30D", on="Invoice_Date").mean()["Delay"]
results in
Invoice_Date
2020-06-26 0.6
2020-07-26 10.5
But I don't know how to place the results properly at the correct postion. Maybe some else has an idea.
Based on Andre S.'s answer you could do
delays = df.groupby("ID").resample("30M", on="Invoice_Date").mean()["Delay"]
and place them into df with following:
df['Average_Delay'] = np.nan
for id, invoice_date in delays.index:
df.loc[(df['ID'] == id) & (df['Invoice_Date'] == invoice_date),"Average_Delay"] = delays[(id,invoice_date)]
But I am afraid some dates may not match with Invoice_Date
. You could do this for each month with "1M" resample frequency. Another approach is to use ID and Invoide_Date as an index together, however I did not mentioned it as it changes the structure of df
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.