I'm facing a complex problem. I have a first dataframe in which I have customers (Note that ClientID is not Unique, you can have the same ClientID associated with a different TestDate) :
df1 :
ClientID TestDate
1A 2019-12-24
1B 2019-08-26
1B 2020-01-12
I have another dataframe of "operations" indicating date and which client is involved
df2 :
LineNumber ClientID Date Amount
1 1A 2020-01-12 50
2 1A 2019-09-24 15
3 1A 2019-12-25 20
4 1A 2018-12-30 30
5 1B 2018-12-30 60
6 1B 2019-12-12 40
What I want is to add to df1 a column containing Mean Amount and Number of Rows, but only taking df2 rows in which Date < TestDate
For example, for client 1A, I'll only take LineNumber 2 and 4 (because Date of line 1 and 3 is later than TestDate) and then obtain the following output for df1 :
Expected df1 :
ClientID TestDate NumberOp MeanOp
1A 2019-12-24 2 22.5
1B 2019-08-26 1 60
1B 2020-01-12 2 50
Note : With the first row of 1B Client, since the TestDate is 2019-08-26
, only one operation is seen (the LineNumber 6 operation is done in 2019-12-12
, so AFTER testDate, so I don't take it into account).
I already have a code to do it, but I have to use iterrows
on my df1
, which takes ages :
Current code (working but long) :
for index, row in df1.iterrows():
id = row['ClientID']
date = row['TestDate']
df2_known = df2.loc[df2['ClientID'] == id]
df2_known = df2_known.loc[df2_known['Date'] < date]
df1.loc[index, 'NumberOp'] = df2_known.shape[0]
df1.loc[index, 'MeanOp'] = df2_known['Amount'].mean()
I had the idea to use aggregates, and commands like mean
and count
, but the fact I have to filter by date for each row is a huge problem I can't figure out. Many thanks for help in advance.
Edit : Remaining issue :
The fix given in the answer's edit ("in case you want to preserve missing matching keys of df2") is not corresponding to my issue.
In fact, I want to avoid losing an equivalent row of df1 if no operation in df2 can be used to calculate the mean and count. I'll show you the problem with an example :
df = df2.merge(df1, on=['ClientID'], how='right')
print(df[df['ClientID'] == '5C'])
Output :
ClientID TestDate Date Amount
5C 2019-12-12 2020-01-12 50
If I do the groupby
and transform
as it is given in the answer, my output will not have any row with CliendID == '5C'
, because Date < TestDate
and Date is null
never happen, so the line is lost when I do df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
. I personally want to have a row with CliendID == '5C'
in my final output, that would look like that :
ClientID TestDate NumberOp MeanOp
5C 2019-12-12 0 NaN
You can merge and transform:
df = df2.merge(df1, on=['ClientID'])
#filter based on condition
df = df[df['Date']<df['TestDate']]
#get the mean and count into new columns
df['MeanOp'] = df.groupby(['ClientID'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID'])['Amount'].transform('count')
#drop duplicates and irrelevant columns
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()
output:
ClientID TestDate MeanOp NumberOp
1 1A 2019-12-24 22.5 2
4 1B 2019-08-26 70.0 1
EDIT : in case you want to preserve missing matching keys of df2
:
df = df2.merge(df1, on=['ClientID'], how='right')
df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
df['MeanOp'] = df.groupby(['ClientID'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID'])['Amount'].transform('count')
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()
example:
df1:
ClientID TestDate
0 1A 2019-12-24
1 1B 2019-08-26
2 1C 2019-08-26
output:
ClientID TestDate MeanOp NumberOp
1 1A 2019-12-24 22.5 2
4 1B 2019-08-26 70.0 1
5 1C 2019-08-26 NaN 0
UPDATE : based on the edit on the post, if you want to group them by (Client_ID, TestDate)
:
df = df2.merge(df1, on=['ClientID'], how='right')
df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
df['MeanOp'] = df.groupby(['ClientID','TestDate'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID','TestDate'])['Amount'].transform('count')
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()
output:
df1
ClientID TestDate
0 1A 2019-12-24
1 1B 2019-08-26
2 1B 2020-01-12
3 1C 2019-08-26
df2
LineNumber ClientID Date Amount
0 1 1A 2020-01-12 50
1 2 1A 2019-09-24 15
2 3 1A 2019-12-25 20
3 4 1A 2018-12-30 30
4 5 1B 2018-12-30 60
5 6 1B 2019-12-12 40
df
ClientID TestDate MeanOp NumberOp
1 1A 2019-12-24 22.5 2
4 1B 2019-08-26 60.0 1
6 1B 2020-01-12 50.0 2
8 1C 2019-08-26 NaN 0
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.