简体   繁体   中英

Groupby with finding highest value in subset

I have data as follows:

In [16]: game_df.head(9)
Out[16]: 
   team_id  game_id game_date  w  l  wins  losses  winning%  
0        1        1  11/16/18  1  0    20      10  0.666667
1        1        3  11/18/18  0  1    20      11  0.645161
2        1        6  11/21/18  0  1    20      12  0.625000
3        2        4  11/19/18  1  0    16      14  0.533333
4        2        8  11/23/18  1  0    17      14  0.548387
5        2        9  11/24/18  0  1    17      15  0.531250
6        3        2  11/17/18  0  1    24       8  0.750000
7        3        5  11/20/18  1  0    25       8  0.757576
8        3        7  11/22/18  1  0    26       8  0.764706

What I need is to take the Winning% column and subtract each row's observation from the latest observation for each team_id (inclusive) but only use the largest value.

So I would want to get something like this back:

In [16]: game_df.head(9)
Out[16]: 
   team_id  game_id game_date  w  l  wins  losses  winning% w%_bac
0        1        1  11/16/18  1  0    20      10  0.666667      --
1        1        3  11/18/18  0  1    20      11  0.645161  -0.10483
2        1        6  11/21/18  0  1    20      12  0.625000  -0.13257
3        2        4  11/19/18  1  0    16      14  0.533333  -0.21667
4        2        8  11/23/18  1  0    17      14  0.548387  -0.21632
5        2        9  11/24/18  0  1    17      15  0.531250  -0.23346
6        3        2  11/17/18  0  1    24       8  0.750000   0.00000
7        3        5  11/20/18  1  0    25       8  0.757576   0.00000
8        3        7  11/22/18  1  0    26       8  0.764706   0.00000

So in game 9 on 11/24/18 team 2 lost and its winning% fell from 0.548387 to 0.531250. It therefore fell behind further relative to the other 2 teams - who, at that point stood at 0.625000 (team #1) & 0.764706 (team #3). So the %back team #2 would be is -0.233456.

Finally, I need to calculate where in order each team_id would be at that moment, ie, on 11/24/18 the team_id ranking would be 3,1,2.

thanks

df = df.sort_values(by='game_date')  # sort by date

# add a column for each team's latest %age, fill forward NaN (but not back)
for team_id in df['team_id'].unique():
    df[str(team_id) + 'win_%'] = df.loc[df.team_id == team_id, ['winning%', 'game_date']].set_index(
        'game_date').reindex(df.game_date).sort_index().fillna(method='ffill').values
# fillback missing (NaN) with 0
df = df.fillna(0)
# get min difference (greatest negative) for each row
df['w%_bac'] = pd.concat([df['winning%'] - df['1win_%'], df['winning%'] - df['2win_%'], df['winning%'] - 
                          df['3win_%']], axis=1).min(1)
# drop helper columns
df = df.drop(columns=['1win_%', '2win_%', '3win_%'])

df

    team_id     game_id     game_date   w   l   wins    losses  winning%    w%_bac
0   1             1     11/16/18         1  0   20      10      0.667   0.000
6   3             2     11/17/18         0  1   24      8       0.750   0.000
1   1             3     11/18/18         0  1   20      11      0.645   -0.105
3   2             4     11/19/18         1  0   16      14      0.533   -0.217
7   3             5     11/20/18         1  0   25      8       0.758   0.000
2   1             6     11/21/18         0  1   20      12      0.625   -0.133
8   3             7     11/22/18         1  0   26      8       0.765   0.000
4   2             8     11/23/18         1  0   17     14       0.548   -0.216
5   2             9     11/24/18         0  1   17     15       0.531   -0.233

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM