简体   繁体   中英

Fastest way to do cumulative totals in Pandas dataframe

I've got a pandas dataframe of golfers' round scores going back to 2003 (approx 300000 rows). It looks something like this:

Date----Golfer---Tournament-----Score---Player Total Rounds Played

2008-01-01---Tiger Woods----Invented Tournament R1---72---50

2008-01-01---Phil Mickelson----Invented Tournament R1---73---108

I want the 'Player Total Rounds Played' column to be a running total of the number of rounds (ie instance in the dataframe) that a player has played up to that date. Is there a quick way of doing it? My current solution (basically using iterrows and then a one-line function) works fine but will take approx 11hrs to run.

Thanks,

Tom

Here is one way:

df = df.sort_values('Date')
df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()

For example:

import pandas as pd

df = pd.DataFrame([['A', 70, 50],
                   ['B', 72, 55],
                   ['A', 73, 45],
                   ['A', 71, 60],
                   ['B', 74, 55],
                   ['A', 72, 65]],
                  columns=['Golfer', 'Rounds', 'Played'])

df['Rounds CumSum'] = df.groupby('Golfer')['Rounds'].cumsum()

#   Golfer  Rounds  Played  Rounds CumSum
# 0      A      70      50             70
# 1      B      72      55             72
# 2      A      73      45            143
# 3      A      71      60            214
# 4      B      74      55            146
# 5      A      72      65            286

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM