简体   繁体   中英

Python Pandas sum with multiple conditions

Below is my sample data:

        Customer   Document Date   Clearing Date   Invoice_Amount
0       A          09/13/2016      11/04/2016      2,007,324
1       A          04/18/2016      07/11/2016      631,714
2       A          09/13/2016      09/16/2016      4,000,000
3       A          07/11/2017      09/23/2017      5,000,000
4       A          05/03/2016      06/17/2016      2,000,000
---     ---        ---             ---             ---
1158    H          04/21/2017      06/28/2017      3,000,000
1159    H          04/25/2017      05/19/2017      1,000,000
1160    H          11/03/2017      12/11/2017      4,500,000
1161    H          03/15/2018      05/27/2018      3,500,000
1162    H          02/21/2018      05/03/2018      1,500,000

I want to create a new variable(add a new column after Invoice_Amount) No_Paid , which calculate "number of paid invoices prior to the Document date of a new invoice of a customer."

The expected output is as follows...

        Customer   Document Date   Clearing Date   Invoice_Amount No_Paid*
0       A          09/13/2016      11/04/2016      2,007,324          8 
1       A          04/18/2016      07/11/2016      631,714            1
2       A          09/13/2016      09/16/2016      4,000,000          8
3       A          07/11/2017      09/23/2017      5,000,000          6
4       A          05/03/2016      06/17/2016      2,000,000          1
---     ---        ---             ---             ---              ---
1158    H          04/21/2017      06/28/2017      3,000,000          5 
1159    H          04/25/2017      05/19/2017      1,000,000          3
1160    H          11/03/2017      12/11/2017      4,500,000          7
1161    H          03/15/2018      05/27/2018      3,500,000         37
1162    H          02/21/2018      05/03/2018      1,500,000         37

Currently, I use for loop to achieve the expected output

import pandas as pd
df = pd.read_csv('E:\data.csv')
df['Document Date'] = pd.to_datetime(df['Document Date'],format="%m/%d/%Y")
df['Clearing Date'] = pd.to_datetime(df['Clearing Date'],format="%m/%d/%Y")
df["No_Paid"] = ""
for i in df.index: 
     Vendor= df.loc[i,"Vendor"]
     Doc_Date= df.loc[i,"Document Date"]
     Six_Month = Doc_Date - pd.Timedelta(days=180)
     df.loc[i,"No_Paid"] = df.loc[(df["Vendor"] == Vendor) & (df["Clearing Date"] < Doc_Date) & (df["Document Date"] >= Six_Month),"Invoice_Amount"].count()

In real case, i have over 100,000 invoices data, which take a longer time I try to use df.apply ...But can't reach the same output...

Going by your example:

import pandas as pd
# read in csv (save as csv or read in using pd.read_excel)
df = pd.read_csv('file.csv')
# to datetime just in case
df['Doc_Date'] = pd.to_datetime(df['Doc_Date'])
df['Exp_Date'] = pd.to_datetime(df['Exp_Date'])
df['Overdue'] = df['Doc_Date'] - df['Exp_Date']
# 180 days for 6 months
df['6M_Age'] = df['Doc_Date'] - pd.Timedelta(days=180)
# Hard to tell what the line in the middle of the data means
# you can group by two columns if you need too
df['Sum_of_paid'] = df.groupby('ID').cumsum()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM