简体   繁体   中英

Find the number of days since a max value

Given the following DataFrame:

+----+--------+------------+------+---------------------+
| id | player | match_date | stat | days_since_max_stat |
+----+--------+------------+------+---------------------+
|  1 |      1 | 2022-01-01 | 1500 | NaN                 |
|  2 |      1 | 2022-01-03 | 1600 | 2                   |
|  3 |      1 | 2022-01-10 | 2100 | 7                   |
|  4 |      1 | 2022-01-11 | 1800 | 1                   |
|  5 |      1 | 2022-01-18 | 1700 | 8                   |
|  6 |      2 | 2022-01-01 | 1600 | NaN                 |
|  7 |      2 | 2022-01-03 | 1800 | 2                   |
|  8 |      2 | 2022-01-10 | 1600 | 7                   |
|  9 |      2 | 2022-01-11 | 1900 | 8                   |
| 10 |      2 | 2022-01-18 | 1500 | 7                   |
+----+--------+------------+------+---------------------+

How would I calculate the days_since_max_stat column? The calculation of this column is exclusive of the stat in that row and per player .

For example the value for the row where id = 5 is 8 because the max stat was in the row where id = 3. The days_since_max_stat = 2022-01-18 - 2022-01-10 = 8.

Here's the base DataFrame:

import datetime as dt
import pandas as pd


dates = [
    dt.datetime(2022, 1, 1),
    dt.datetime(2022, 1, 3),
    dt.datetime(2022, 1, 10),
    dt.datetime(2022, 1, 11),
    dt.datetime(2022, 1, 18),
]
df = pd.DataFrame(
    {
        "id": range(1, 11),
        "player": [1 for i in range(5)] + [2 for i in range(5)],
        "match_date": dates + dates,
        "stat": (1500, 1600, 2100, 1800, 1700, 1600, 1800, 1600, 1900, 1500)
    }
)

You can use a double groupby . The important part is to compute a new group to put together the rows that are lower than the last max. Once you have done that this is a simple cumsum per group:

g = df.groupby(df['player'])
# date diff per group (days)
diff = g['match_date'].diff().dt.days
# group per lower than last max
g2 = df['stat'].ge(g['stat'].cummax()).shift().cumsum()
# days since last max
df['dsms'] = diff.groupby([df['player'], g2]).cumsum()

Output:

   id  player match_date  stat  dsms
0   1       1 2022-01-01  1500   NaN
1   2       1 2022-01-03  1600   2.0
2   3       1 2022-01-10  2100   7.0
3   4       1 2022-01-11  1800   1.0
4   5       1 2022-01-18  1700   8.0
5   6       2 2022-01-01  1600   NaN
6   7       2 2022-01-03  1800   2.0
7   8       2 2022-01-10  1600   7.0
8   9       2 2022-01-11  1900   8.0

First imagine you have only one id , then you can use expanding to find the cummulative max/idxmax. then you can subtract:

def day_since_max(data):
    maxIdx = data['stat'].expanding().apply(pd.Series.idxmax)
    date_at_max = data.loc[maxIdx, 'match_date'].shift()
    return data['match_date'] - date_at_max.values

Now, we can use groupby().apply to apply that function for each id :

df['days_since_max'] = df.groupby('player').apply(day_since_max).reset_index(level=0, drop=True)

Output:

   id  player match_date  stat days_since_max
0   1       1 2022-01-01  1500            NaT
1   2       1 2022-01-03  1600         2 days
2   3       1 2022-01-10  2100         7 days
3   4       1 2022-01-11  1800         1 days
4   5       1 2022-01-18  1700         8 days
5   6       2 2022-01-01  1600            NaT
6   7       2 2022-01-03  1800         2 days
7   8       2 2022-01-10  1600         7 days
8   9       2 2022-01-11  1900         8 days
9  10       2 2022-01-18  1500         7 days

Here's another solution that is very similar to @QuangHoang's solution, only instead of applying a function to get the values, it uses transform .

(i) groupby "player" and in an expanding window, find the max "stat" for each player

(ii) use groupby on (i) by "player" and "stat" and find the index of the max stat of each group and transform it for the Series

(iii) filter df with (ii), then groupby "player" again and shift (so that we can find difference in days)

(iv) find difference in days

exp_max = df.groupby('player')['stat'].expanding().max().reset_index(level=0)
idxmax = exp_max.groupby(['player','stat'])['stat'].transform('idxmax')
previous_max = df.loc[idxmax, 'match_date'].groupby(df['player']).shift().reset_index(drop=True)
df['days_since_max'] = df['match_date'] - previous_max

Output:

   id  player match_date  stat days_since_max
0   1       1 2022-01-01  1500            NaT
1   2       1 2022-01-03  1600         2 days
2   3       1 2022-01-10  2100         7 days
3   4       1 2022-01-11  1800         1 days
4   5       1 2022-01-18  1700         8 days
5   6       2 2022-01-01  1600            NaT
6   7       2 2022-01-03  1800         2 days
7   8       2 2022-01-10  1600         7 days
8   9       2 2022-01-11  1900         8 days
9  10       2 2022-01-18  1500         7 days

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM