简体   繁体   中英

Pandas column where each value depends on another df query

I'm facing a complex problem. I have a first dataframe in which I have customers (Note that ClientID is not Unique, you can have the same ClientID associated with a different TestDate) :

df1 :

ClientID  TestDate
1A        2019-12-24
1B        2019-08-26
1B        2020-01-12

I have another dataframe of "operations" indicating date and which client is involved

df2 :

LineNumber  ClientID  Date          Amount
1           1A        2020-01-12    50
2           1A        2019-09-24    15
3           1A        2019-12-25    20
4           1A        2018-12-30    30
5           1B        2018-12-30    60
6           1B        2019-12-12    40

What I want is to add to df1 a column containing Mean Amount and Number of Rows, but only taking df2 rows in which Date < TestDate

For example, for client 1A, I'll only take LineNumber 2 and 4 (because Date of line 1 and 3 is later than TestDate) and then obtain the following output for df1 :

Expected df1 :

ClientID  TestDate      NumberOp  MeanOp
1A        2019-12-24    2         22.5
1B        2019-08-26    1         60
1B        2020-01-12    2         50

Note : With the first row of 1B Client, since the TestDate is 2019-08-26 , only one operation is seen (the LineNumber 6 operation is done in 2019-12-12 , so AFTER testDate, so I don't take it into account).

I already have a code to do it, but I have to use iterrows on my df1 , which takes ages :

Current code (working but long) :

for index, row in df1.iterrows():
    id = row['ClientID']
    date = row['TestDate']
    df2_known = df2.loc[df2['ClientID'] == id]
    df2_known = df2_known.loc[df2_known['Date'] < date]
    df1.loc[index, 'NumberOp'] = df2_known.shape[0]
    df1.loc[index, 'MeanOp'] = df2_known['Amount'].mean()

I had the idea to use aggregates, and commands like mean and count , but the fact I have to filter by date for each row is a huge problem I can't figure out. Many thanks for help in advance.

Edit : Remaining issue :

The fix given in the answer's edit ("in case you want to preserve missing matching keys of df2") is not corresponding to my issue.

In fact, I want to avoid losing an equivalent row of df1 if no operation in df2 can be used to calculate the mean and count. I'll show you the problem with an example :

df = df2.merge(df1, on=['ClientID'], how='right')
print(df[df['ClientID'] == '5C'])

Output :
ClientID  TestDate    Date          Amount
5C        2019-12-12  2020-01-12    50     

If I do the groupby and transform as it is given in the answer, my output will not have any row with CliendID == '5C' , because Date < TestDate and Date is null never happen, so the line is lost when I do df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())] df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())] . I personally want to have a row with CliendID == '5C' in my final output, that would look like that :

ClientID  TestDate      NumberOp  MeanOp
5C        2019-12-12    0         NaN

You can merge and transform:

df = df2.merge(df1, on=['ClientID'])
#filter based on condition
df = df[df['Date']<df['TestDate']]
#get the mean and count into new columns
df['MeanOp'] = df.groupby(['ClientID'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID'])['Amount'].transform('count')
#drop duplicates and irrelevant columns
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()

output:

  ClientID    TestDate  MeanOp  NumberOp
1       1A  2019-12-24    22.5         2
4       1B  2019-08-26    70.0         1

EDIT : in case you want to preserve missing matching keys of df2 :

df = df2.merge(df1, on=['ClientID'], how='right')
df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
df['MeanOp'] = df.groupby(['ClientID'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID'])['Amount'].transform('count')
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()

example:

df1:

  ClientID    TestDate
0       1A  2019-12-24
1       1B  2019-08-26
2       1C  2019-08-26

output:

  ClientID    TestDate  MeanOp  NumberOp
1       1A  2019-12-24    22.5         2
4       1B  2019-08-26    70.0         1
5       1C  2019-08-26     NaN         0

UPDATE : based on the edit on the post, if you want to group them by (Client_ID, TestDate) :

df = df2.merge(df1, on=['ClientID'], how='right')
df = df[(df['Date']<df['TestDate']) | (df['Date'].isnull())]
df['MeanOp'] = df.groupby(['ClientID','TestDate'])['Amount'].transform('mean')
df['NumberOp'] = df.groupby(['ClientID','TestDate'])['Amount'].transform('count')
df = df.drop(['Amount','Date','LineNumber'],1).drop_duplicates()

output:

df1
  ClientID    TestDate
0       1A  2019-12-24
1       1B  2019-08-26
2       1B  2020-01-12
3       1C  2019-08-26

df2
   LineNumber ClientID        Date  Amount
0           1       1A  2020-01-12      50
1           2       1A  2019-09-24      15
2           3       1A  2019-12-25      20
3           4       1A  2018-12-30      30
4           5       1B  2018-12-30      60
5           6       1B  2019-12-12      40

df
  ClientID    TestDate  MeanOp  NumberOp
1       1A  2019-12-24    22.5         2
4       1B  2019-08-26    60.0         1
6       1B  2020-01-12    50.0         2
8       1C  2019-08-26     NaN         0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM