简体   繁体   中英

Is there a faster way to make changes to pandas groups than a for loop

I have the dataframe below that I am working with:

These are chess games which I am trying to group by game and then perform a function on each game based on the number of moves played in that game...

        game_id     move_number colour  avg_centi
0       03gDhPWr    1           white   NaN
1       03gDhPWr    2           black   37.0
2       03gDhPWr    3           white   61.0
3       03gDhPWr    4           black   -5.0
4       03gDhPWr    5           white   26.0
5       03gDhPWr    6           black   31.0
6       03gDhPWr    7           white   -2.0
... ... ... ... ...
110091  zzaiRa7s    34          black   NaN
110092  zzaiRa7s    35          white   NaN
110093  zzaiRa7s    36          black   NaN
110094  zzaiRa7s    37          white   NaN
110095  zzaiRa7s    38          black   NaN
110096  zzaiRa7s    39          white   NaN
110097  zzaiRa7s    40          black   NaN

Specifically I am using pd.cut to create a new column, game_phase , which lists whether the given move was played in the opening, middlegame, endgame.

     game_id  move_number colour  avg_centi    phase
0   03gDhPWr            1  white        NaN  opening
1   03gDhPWr            2  black       37.0  opening
2   03gDhPWr            3  white       61.0  opening
3   03gDhPWr            4  black       -5.0  opening
4   03gDhPWr            5  white       26.0  opening
5   03gDhPWr            6  black       31.0  opening
6   03gDhPWr            7  white       -2.0  opening
..       ...          ...    ...        ...      ...
54  03gDhPWr           55  white       58.0  endgame
55  03gDhPWr           56  black       26.0  endgame
56  03gDhPWr           57  white      116.0  endgame
57  03gDhPWr           58  black     2000.0  endgame
58  03gDhPWr           59  white        0.0  endgame
59  03gDhPWr           60  black        0.0  endgame
60  03gDhPWr           61  white        NaN  endgame

I'm using the following code to achieve this. Note that each game must be partitioned into opening , middlegame , and endgame bins based on the total number of moves played in that game.

for game_id, group in df.groupby('game_id'):
    bins = (0, round(group['move_number'].max() * 1/3), round(group['move_number'].max() * 2/3), 
            group['move_number'].max())
    phases = ["opening", "middlegame", "endgame"]
    try:
        group.loc[:, 'phase'] = pd.cut(group['move_number'], bins, labels=phases)
    except:
        group.loc[:, 'phase'] = None
    print(group)

The problem is that iterating through every single game from thousands of games takes forever to find this.

I am thinking that there must be faster way to calculate this, rather than using a for loop to iterate through the groups and perform the calculation one by one.

Here is a method I came up with using a simple example.

To summarize, 3 steps:

  1. you can find the max move number of each game using groupby
  2. merge new df to old df including max move number
  3. add phase for all games at once by calculating move number/max move number

My method is in test1() while yours is in test2() :

import pandas
import random
import time

a = []

for group in range(25):
    for count in range(random.randint(900, 1000)):
        a.append({'group': chr(65 + group), 'count': count})


def test1(x):
    b = pandas.DataFrame(x)

    max_df = b.groupby(by='group', as_index=False)['count'].max().rename(columns={'count': 'max'})

    b = pandas.merge(b, max_df, on='group', how='left')

    b['phase'] = 'opening'
    b.loc[b['count'] > b['max'] / 3.0, 'phase'] = 'middlegame'
    b.loc[b['count'] > b['max'] / 1.5, 'phase'] = 'endgame'
    b.drop('max', axis=1, inplace=True)
    return b


def test2(x):
    df = pandas.DataFrame(x)
    df['phase'] = ''
    for game_id, group in df.groupby('group'):
        bins = (0, round(group['count'].max() * 1 / 3), round(group['count'].max() * 2 / 3),
                group['count'].max())
        phases = ["opening", "middlegame", "endgame"]
        try:
            group.loc[:, 'phase'] = pandas.cut(group['count'], bins, labels=phases)
        except:
            group.loc[:, 'phase'] = None
    return df


start_time = time.time()
out1 = test1(a)
print(time.time() - start_time)

start_time = time.time()
out2 = test2(a)
print(time.time() - start_time)

assert out1.to_dict() == out2.to_dict()

This is test1 is a lot faster than test2 , though this is only 1 run:

test1: 0.09799647331237793
test2: 0.769993782043457

And test2() seem to have some issues: it doesn't actually modify the dataframe so the phase column is empty. Not sure if it worked for you.

Here is a try at using apply:

def split_by_third(game):
    game_length = len(game)
    game = game.assign(phase_num=game['move_number']/game_length)

    return game

def assign_phase(row):
    if row['phase_num'] < 0.34:
        return 'Beginning'
    if row['phase_num'] > 0.34 and row['phase_num'] < 0.66:
        return 'Middle'
    if row['phase_num'] > 0.66:
        return 'End'

df_grouped = df.groupby('game_id').apply(split_by_third)

df_grouped['phase'] =df_grouped.apply(lambda row: assign_phase(row), axis=1)

I was able to get it to work with cleaner and faster code using groupby.apply as suggested by @AlexanderReynolds

def define_move_phase(x):
    bins = (0, round(x['move_number'].max() * 1/3), round(x['move_number'].max() * 2/3), x['move_number'].max())    
    phases = ["opening", "middlegame", "endgame"]
    try:
        x.loc[:, 'phase'] = pd.cut(x['move_number'], bins, labels=phases)
    except ValueError:
        x.loc[:, 'phase'] = None
    return x

df.groupby('game_id').apply(define_move_phase)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM