I have the dataframe below that I am working with:
These are chess games which I am trying to group by game and then perform a function on each game based on the number of moves played in that game...
game_id move_number colour avg_centi
0 03gDhPWr 1 white NaN
1 03gDhPWr 2 black 37.0
2 03gDhPWr 3 white 61.0
3 03gDhPWr 4 black -5.0
4 03gDhPWr 5 white 26.0
5 03gDhPWr 6 black 31.0
6 03gDhPWr 7 white -2.0
... ... ... ... ...
110091 zzaiRa7s 34 black NaN
110092 zzaiRa7s 35 white NaN
110093 zzaiRa7s 36 black NaN
110094 zzaiRa7s 37 white NaN
110095 zzaiRa7s 38 black NaN
110096 zzaiRa7s 39 white NaN
110097 zzaiRa7s 40 black NaN
Specifically I am using pd.cut
to create a new column, game_phase
, which lists whether the given move was played in the opening, middlegame, endgame.
game_id move_number colour avg_centi phase
0 03gDhPWr 1 white NaN opening
1 03gDhPWr 2 black 37.0 opening
2 03gDhPWr 3 white 61.0 opening
3 03gDhPWr 4 black -5.0 opening
4 03gDhPWr 5 white 26.0 opening
5 03gDhPWr 6 black 31.0 opening
6 03gDhPWr 7 white -2.0 opening
.. ... ... ... ... ...
54 03gDhPWr 55 white 58.0 endgame
55 03gDhPWr 56 black 26.0 endgame
56 03gDhPWr 57 white 116.0 endgame
57 03gDhPWr 58 black 2000.0 endgame
58 03gDhPWr 59 white 0.0 endgame
59 03gDhPWr 60 black 0.0 endgame
60 03gDhPWr 61 white NaN endgame
I'm using the following code to achieve this. Note that each game must be partitioned into opening
, middlegame
, and endgame
bins based on the total number of moves played in that game.
for game_id, group in df.groupby('game_id'):
bins = (0, round(group['move_number'].max() * 1/3), round(group['move_number'].max() * 2/3),
group['move_number'].max())
phases = ["opening", "middlegame", "endgame"]
try:
group.loc[:, 'phase'] = pd.cut(group['move_number'], bins, labels=phases)
except:
group.loc[:, 'phase'] = None
print(group)
The problem is that iterating through every single game from thousands of games takes forever to find this.
I am thinking that there must be faster way to calculate this, rather than using a for
loop to iterate through the groups and perform the calculation one by one.
Here is a method I came up with using a simple example.
To summarize, 3 steps:
max move number
of each game using groupby max move number
move number/max move number
My method is in test1()
while yours is in test2()
:
import pandas
import random
import time
a = []
for group in range(25):
for count in range(random.randint(900, 1000)):
a.append({'group': chr(65 + group), 'count': count})
def test1(x):
b = pandas.DataFrame(x)
max_df = b.groupby(by='group', as_index=False)['count'].max().rename(columns={'count': 'max'})
b = pandas.merge(b, max_df, on='group', how='left')
b['phase'] = 'opening'
b.loc[b['count'] > b['max'] / 3.0, 'phase'] = 'middlegame'
b.loc[b['count'] > b['max'] / 1.5, 'phase'] = 'endgame'
b.drop('max', axis=1, inplace=True)
return b
def test2(x):
df = pandas.DataFrame(x)
df['phase'] = ''
for game_id, group in df.groupby('group'):
bins = (0, round(group['count'].max() * 1 / 3), round(group['count'].max() * 2 / 3),
group['count'].max())
phases = ["opening", "middlegame", "endgame"]
try:
group.loc[:, 'phase'] = pandas.cut(group['count'], bins, labels=phases)
except:
group.loc[:, 'phase'] = None
return df
start_time = time.time()
out1 = test1(a)
print(time.time() - start_time)
start_time = time.time()
out2 = test2(a)
print(time.time() - start_time)
assert out1.to_dict() == out2.to_dict()
This is test1
is a lot faster than test2
, though this is only 1 run:
test1: 0.09799647331237793
test2: 0.769993782043457
And test2()
seem to have some issues: it doesn't actually modify the dataframe so the phase
column is empty. Not sure if it worked for you.
Here is a try at using apply:
def split_by_third(game):
game_length = len(game)
game = game.assign(phase_num=game['move_number']/game_length)
return game
def assign_phase(row):
if row['phase_num'] < 0.34:
return 'Beginning'
if row['phase_num'] > 0.34 and row['phase_num'] < 0.66:
return 'Middle'
if row['phase_num'] > 0.66:
return 'End'
df_grouped = df.groupby('game_id').apply(split_by_third)
df_grouped['phase'] =df_grouped.apply(lambda row: assign_phase(row), axis=1)
I was able to get it to work with cleaner and faster code using groupby.apply
as suggested by @AlexanderReynolds
def define_move_phase(x):
bins = (0, round(x['move_number'].max() * 1/3), round(x['move_number'].max() * 2/3), x['move_number'].max())
phases = ["opening", "middlegame", "endgame"]
try:
x.loc[:, 'phase'] = pd.cut(x['move_number'], bins, labels=phases)
except ValueError:
x.loc[:, 'phase'] = None
return x
df.groupby('game_id').apply(define_move_phase)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.