简体   繁体   中英

Pandas: Group by, Cumsum + Shift with a “where clause”

I am attempting to learn some Pandas that I otherwise would be doing in SQL window functions.

Assume I have the following dataframe which shows different players previous matches played and how many kills they got in each match.

 date          player        kills 
 2019-01-01      a             15
 2019-01-02      b             20
 2019-01-03      a             10
 2019-03-04      a             20

Throughout the below code I managed to create a groupby where I only show previous summed values of kills (the sum of the players kills excluding the kills he got in the game of the current row).

df['sum_kills'] =    df.groupby('player')['kills'].transform(lambda x: x.cumsum().shift())

This creates the following values:

 date          player        kills    sum_kills
 2019-01-01      a             15      NaN
 2019-01-02      b             20      NaN
 2019-01-03      a             10      15
 2019-03-04      a             20      25

However what I ideally want is the option to include a filter/where clause in the grouped values. So let's say I only wanted to get the summed values from the previous 30 days (1 month). Then my new dataframe should instead look like this:

 date          player        kills    sum_kills
 2019-01-01      a             15      NaN
 2019-01-02      b             20      NaN
 2019-01-03      a             10      15
 2019-03-04      a             20      NaN

The last row would provide zero summed_kills because no games from player a had been played over the last month. Is this possible somehow?

I think you are a bit in a pinch using groupby and transform . As explained here , transform operates on a single series, so you can't access data of other columns.
groupby and apply does not seem the correct way too, because the custom function is expected to return an aggregated result for the group passed by groupby , but you want a different result for each row.

So the best solution I can propose is to use apply without groupy , and perform all the selection by yourself inside the custom function:

def killcount(x, data, timewin):
    """count the player's kills in a time window before the time of current row.
    x: dataframe row
    data: full dataframe
    timewin: a pandas.Timedelta
    """
    return data.loc[(data['date'] < x['date']) #select dates preceding current row
            & (data['date'] >= x['date']-timewin) #select dates in the timewin                
            & (data['player'] == x['player'])]['kills'].sum() #select rows with same player

df['sum_kills'] = df.apply(lambda r : killcount(r, df, pd.Timedelta(30, 'D')), axis=1)

This returns:

        date player  kills  sum_kills
0 2019-01-01      a     15          0
1 2019-01-02      b     20          0
2 2019-01-03      a     10         15
3 2019-03-04      a     20          0

In case you haven't done yet, remember do parse 'date' column to datetime type using pandas.to_datetime otherwise you cannot perform date comparison.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM