简体   繁体   中英

How to add a calculated column based on conditions in pandas?

I have a dataframe below:

import pandas as pd

data = pd.DataFrame({
        'ID':  ['27459', '27459', '27459', '27459', '27459', '27459', '27459', '48002', '48002', '48002'],
        'Invoice_Date': ['2020-06-26', '2020-06-29', '2020-06-30', '2020-07-14', '2020-07-25', 
                         '2020-07-30', '2020-08-02', '2020-05-13', '2020-06-20', '2020-06-28'],
        'Payment_Term': [7,8,3,6,4,7,8,5,3,6],
        'Payment_Date': ['2020-07-05', '2020-07-05','2020-07-03', '2020-07-21', '2020-07-31', 
                         '2020-08-15', '2020-08-22', '2020-06-16', '2020-06-23', '2020-07-05'],
        })

df = pd.DataFrame(data, columns = ['ID', 'Invoice_Date', 'Payment_Term', 'Payment_Date'])

df['Invoice_Date'] = pd.to_datetime(df['Invoice_Date'].astype(str), format='%Y-%m-%d')
df['Payment_Date'] = pd.to_datetime(df['Payment_Date'].astype(str), format='%Y-%m-%d')
df['Due_Date'] = df['Invoice_Date'] + pd.to_timedelta(df['Payment_Term'], unit = 'd') 
df['Delay'] = df['Payment_Date'] - df['Due_Date']
df['Delay'] = df['Delay'].dt.days                         

print(df)


Out [1]:

      ID Invoice_Date  Payment_Term Payment_Date   Due_Date  Delay
0  27459   2020-06-26             7   2020-07-05 2020-07-03      2
1  27459   2020-06-29             8   2020-07-05 2020-07-07     -2
2  27459   2020-06-30             3   2020-07-03 2020-07-03      0
3  27459   2020-07-14             6   2020-07-21 2020-07-20      1
4  27459   2020-07-25             4   2020-07-31 2020-07-29      2
5  27459   2020-07-30             7   2020-08-15 2020-08-06      9
6  27459   2020-08-02             8   2020-08-22 2020-08-10     12
7  48002   2020-05-13             5   2020-06-16 2020-05-18     29
8  48002   2020-06-20             3   2020-06-23 2020-06-23      0
9  48002   2020-06-28             6   2020-07-05 2020-07-04      1

Now I want to create a new column name Average_Delay based on this assumption:

  1. Invoice of the last 30 days. Meaning that for ID 27459 will be grouped into two sets of 30-days invoice that is 2020-06-26 until 2020-07-25 and 2020-07-30 until 2020-08-02. ID 48002 will also have two sets 30-days period that is 2020-05-13 and 2020-06-20 until 2020-06-28.
  2. The Average_Delay is recorded on the final date of the ID 30-days period.
  3. The Average_Delay calculation is sum of Delay divided by number of invoice in the 30-days period. The expected output should look more or less like this:
 Out [2]:

      ID Invoice_Date  Payment_Term Payment_Date   Due_Date  Delay   Average_Delay
0  27459   2020-06-26             7   2020-07-05 2020-07-03      2               
1  27459   2020-06-29             8   2020-07-05 2020-07-07     -2
2  27459   2020-06-30             3   2020-07-03 2020-07-03      0
3  27459   2020-07-14             6   2020-07-21 2020-07-20      1
4  27459   2020-07-25             4   2020-07-31 2020-07-29      2            0.6
5  27459   2020-07-30             7   2020-08-15 2020-08-06      9
6  27459   2020-08-02             8   2020-08-22 2020-08-10     12           10.5
7  48002   2020-05-13             5   2020-06-16 2020-05-18     29             29
8  48002   2020-06-20             3   2020-06-23 2020-06-23      0
9  48002   2020-06-28             6   2020-07-05 2020-07-04      1            0.5

The Average_Delay can be calculated using .groupby and .resample like:

df.groupby("ID").get_group("27459").resample("30D", on="Invoice_Date").mean()["Delay"]

results in

Invoice_Date
2020-06-26     0.6
2020-07-26    10.5

But I don't know how to place the results properly at the correct postion. Maybe some else has an idea.

Based on Andre S.'s answer you could do

delays = df.groupby("ID").resample("30M", on="Invoice_Date").mean()["Delay"]

and place them into df with following:

df['Average_Delay'] = np.nan
for id, invoice_date in delays.index:
    df.loc[(df['ID'] == id) & (df['Invoice_Date'] == invoice_date),"Average_Delay"] = delays[(id,invoice_date)]

But I am afraid some dates may not match with Invoice_Date . You could do this for each month with "1M" resample frequency. Another approach is to use ID and Invoide_Date as an index together, however I did not mentioned it as it changes the structure of df .

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM