简体   繁体   中英

FIlter data based on month and ID and sum in Pandas

ID. Email. Amount Date 1. wi@gn.c. 20 26-11-19 12.06.36.726000 2. wi@gn.c. 40 26-12-19 12.06.37.293000 3. by@gn.c. 50 26-11-19 12.06.37.960000 4. wi@gn.c. 20 26-01-20 12.06.51.306000 5. wi@gn.c. 60 26-02-20 12.06.52.458000 6. by@gn.c. 15 26-08-19 12.06.58.397000 7. wi@gn.c. 37 26-12-19 12.07.00.191000 5. wi@gn.c. 60 26-02-20 12.06.52.458000 6. by@gn.c. 15 26-08-19 12.06.58.397000 7. wi@gn.c. 37 26-12-19 12.07.00.191000

I need to get the total amount for each email address for the past 1 month, 3 month and 6 months. I have tried several combinations of commands but I am lost now.

In another answer df.groupby('Email')['Amount'].sum().reset_index() works but i need to add the sum based on the 1 Month, 3 months and 6 months.

The expected result will look like this

ID. Email. Total for past 1 Month Total for past 3 Month Total for past 6 Month 1. wi@gn.c. 20 40 60 3. by@gn.c. 50 50 100

NB: the final figures are not exactly correct, I am just trying to paint a picture of what I am trying to do.

Hope this helps: First convert your 'Date' column to DateTimeIndex. Then you have to segregate your data into groups of 1 month, 3 months and 6 months and create 3 dfs. Aggregate these 3 dfs by sum of 'Amount'. At last, merge all these 3 dfs on 'Email' column.

import numpy as np
import pandas as pd

df = pd.DataFrame([[1,'wi@gn.c.',20,'26-11-19 12.06.36.726000'],
                   [2,'wi@gn.c.',40,'26-12-19 12.06.37.293000'],
                   [3,'by@gn.c.',50,'26-11-19 12.06.37.960000'],
                   [4,'wi@gn.c.',20,'26-01-20 12.06.51.306000'],
                   [5,'wi@gn.c.',60,'26-02-20 12.06.52.458000'],
                   [6,'by@gn.c.',15,'26-08-19 12.06.58.397000'],
                   [7,'wi@gn.c.',37,'26-12-19 12.07.00.191000'],
                   [6,'wi@gn.c.',60,'26-02-20 12.06.52.458000'],
                   [7,'by@gn.c.',15,'26-08-19 12.06.58.397000'],
                   [8,'wi@gn.c.',37,'26-12-19 12.07.00.191000']],
                  columns=['ID','Email','Amount','Date'])

# convert your 'Date' to datetimeindex
df['Date'] = pd.to_datetime(df['Date'], format = '%d-%m-%y %H.%M.%S.%f')
df.set_index('Date', inplace=True)
df.sort_index(inplace=True)

# create dfs from base df for past 1 month, 3 months and 6 months data and aggregate by sum of 'Amount'
end = pd.datetime.now()
df_1mo = df.loc[end - pd.DateOffset(months=1): end].groupby('Email')['Amount'].agg(total_1mo=np.sum)
df_3mo = df.loc[end - pd.DateOffset(months=3): end].groupby('Email')['Amount'].agg(total_3mo=np.sum)
df_6mo = df.loc[end - pd.DateOffset(months=6): end].groupby('Email')['Amount'].agg(total_6mo=np.sum)

# merge all 3 dfs on 'Email'
print(df_1mo.merge(df_3mo, on='Email', how='outer').merge(df_6mo, on='Email', how='outer').fillna(0))

Output:

          total_1mo  total_3mo  total_6mo
Email                                    
wi@gn.c.      120.0      254.0        274
by@gn.c.        0.0        0.0         50
  • In the last 1 month range (Feb 11-Mar 11) you have only 2 rows with Date as 02/26, both with Email wi@gn.c. and the sum of Amount is 60+60=120.
  • In the last 3 month range (Dec 11-Mar 11) you have 6 rows with Date as 02/26/2020, 01/26/2020 and 12/26/2019 all with the same Email wi@gn.c. and the sum of Amount is 60+60+20+37+37+40=254.
  • In the last 6 month range (Sep 11-Mar 11) you have 8 rows with Date as 02/26/2020, 01/26/2020, 12/26/2020 and 11/26/2019. Of this one row is with Email by@gn.c. and Amount as 50. All other rows are with Email wi@gn.c. and the sum of Amount is 60+60+20+37+37+40+20=274.
  • The other 2 rows with Date as 08/26/2020 are not in this range of 6 months so they are excluded.

Hope this explains the answer. You can change the end date to a different date to make your baseline date. Here I have used current date as baseline date.

There may be a better efficient solution for this. But this should work based on your sample dataset. Let me know how it goes.

Update: min and max:

df_1mo = df.loc[end - pd.DateOffset(months=1): end].groupby('Email')['Amount'].agg(total_1mo=np.max)
df_3mo = df.loc[end - pd.DateOffset(months=3): end].groupby('Email')['Amount'].agg(total_3mo=np.max)
df_6mo = df.loc[end - pd.DateOffset(months=6): end].groupby('Email')['Amount'].agg(total_6mo=np.max)

# merge all 3 dfs on 'Email'
print(df_1mo.merge(df_3mo, on='Email', how='outer').merge(df_6mo, on='Email', how='outer').fillna(0))

Output:

          total_1mo  total_3mo  total_6mo
Email                                    
wi@gn.c.       60.0       60.0         60
by@gn.c.        0.0        0.0         50

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM