I am using nba_py to get the scoreboard data for some NBA matches.
Below is an example of how the data are structured:
SEASON | GAME_DATE_EST | GAME_SEQUENCE | GAME_ID | HOME_TEAM_ID | VISITOR_TEAM_ID | WINNER
0 2013 2013-10-05T00:00:00 1 11300001 12321 1610612760 V
1 2013 2013-10-05T00:00:00 2 11300002 1610612754 1610612741 V
2 2013 2013-10-05T00:00:00 3 11300003 1610612745 1610612740 V
3 2013 2013-10-05T00:00:00 4 11300004 1610612747 1610612744 H
4 2013 2013-10-06T00:00:00 1 11300005 12324 1610612755 V
You can find a part of the data here: NBA Games Data .
My aim is to create and add to the original data the following columns:
For the hometeam:
1. Total wins/losses for hometeam if hometeam plays at home ("HOMETEAM_HOME_WINS"/"HOMETEAM_HOME_LOSSES")
2. Total wins/losses for hometeam if hometeam is visiting ("HOMETEAM_VISITOR_WINS"/"HOMETEAM_VISITOR_LOSSES")
For the visitor_team:
3. Total wins/losses for visitor_team if visitor_team plays at home ("VISITOR_TEAM_HOME_WINS"/"VISITOR_TEAM_HOME_LOSSES")
4. Total wins/losses for visitor_team if visitor_team is visiting ("VISITOR_TEAM_VISITOR_WINS"/"VISITOR_TEAM_VISITOR_LOSSES")
My first simplistic approach is below:
def get_home_team_home_wins(x):
hometeam = x.HOME_TEAM_ID
season = x.SEASON
gid = x.name
season_hometeam_games = grouped_seasons_hometeams.get_group((season, hometeam))
home_games = season_hometeam_games[(season_hometeam_games.index < gid)]
if not home_games.empty:
try:
home_wins = home_games.FTR.value_counts()["H"]
except Exception as e:
home_wins = 0
else:
home_wins = 0
grouped_seasons_hometeams = df.groupby(["SEASON", "HOME_TEAM_ID"])
df["HOMETEAM_HOME_WINS"] = df.apply(lambda x: get_home_team_home_wins(x), axis=1)
Another approach is iterating over the rows :
grouped_seasons = df.groupby("SEASON")
df["HOMETEAM_HOME_WINS"] = 0
current_season = 0
for index,row in df.iterrows():
season = row.SEASON
if season != current_season:
current_season = season
season_games = grouped_seasons.get_group(current_season)
hometeam = row.HOME_TEAM_ID
gid = row.name
games = season_games[(season_games.index < gid)]
home_games = games[(games.HOME_TEAM_ID == hometeam)]
if not home_games.empty:
try:
home_wins = home_games.FTR.value_counts()["H"]
except Exception as e:
home_wins = 0
else:
home_wins = 0
row["HOME_TEAM_HOME_WINS_4"] = home_wins
df.ix[index] = row
Update 1:
Below there are functions for getting wins/losses for hometeam if it plays at home and if it visits. A similar one would be for the visitor_team.
def get_home_team_home_wins_losses(x):
hometeam = x.HOME_TEAM_ID
season = x.SEASON
gid = x.name
games = df[(df.SEASON == season) & (df.index < gid)]
home_team_home_games = games[(games.HOME_TEAM_ID == hometeam)]
# HOMETEAM plays at home
if not home_team_home_games.empty:
home_team_home_games_value_counts = home_team_home_games.FTR.value_counts()
try:
home_team_home_wins = home_team_home_games_value_counts["H"]
except Exception as e:
home_team_home_wins = 0
try:
home_team_home_losses = home_team_home_games_value_counts["V"]
except Exception as e:
home_team_home_losses = 0
else:
home_team_home_wins = 0
home_team_home_losses = 0
return [home_team_home_wins, home_team_home_losses]
def get_home_team_visitor_wins_losses(x):
hometeam = x.HOME_TEAM_ID
season = x.SEASON
gid = x.name
games = df[(df.SEASON == season) & (df.index < gid)]
home_team_visitor_games = games[(games.VISITOR_TEAM_ID == hometeam)]
# HOMETEAM visits
if not home_team_visitor_games.empty:
home_team_visitor_games_value_counts = home_team_visitor_games.FTR.value_counts()
try:
home_team_visitor_wins = home_team_visitor_games_value_counts["V"]
except Exception as e:
home_team_visitor_wins = 0
try:
home_team_visitor_losses = home_team_visitor_games_value_counts["H"]
except Exception as e:
home_team_visitor_losses = 0
else:
home_team_visitor_wins = 0
home_team_visitor_losses = 0
return [home_team_visitor_wins, home_team_visitor_losses]
df["HOME_TEAM_HOME_WINS"], df["HOME_TEAM_HOME_LOSSES"] = zip(*df.apply(lambda x: get_home_team_home_wins_losses(x), axis=1))
df["HOME_TEAM_VISITOR_WINS"], df["HOME_TEAM_VISITOR_LOSSES"] = zip(*df.apply(lambda x: get_home_team_visitor_wins_losses(x), axis=1))
df["HOME_TEAM_WINS"] = df["HOME_TEAM_HOME_WINS"] + df["HOME_TEAM_VISITOR_WINS"]
df["HOME_TEAM_LOSSES"] = df["HOME_TEAM_HOME_LOSSES"] + df["HOME_TEAM_VISITOR_LOSSES"]
The above approaches are not efficient. So, I am thinking of using groupby but it's not really clear how.
I will add updates whenever I find something more efficient.
Any ideas ? Thanks.
Consider using transform()
but first conditionally create HOMEWINNER
and VISITWINNER
integer columns. Commented out are easier to read equivalent if/else calculations using numpy.where()
which you may/may not have available as a package.
Do note transform()
retains all rows but will aggregate by the IDs, so every record of a particular HOME_TEAM_ID
should repeat values in these aggregate columns.:
nbadf['VISITWINNER'] = [1 if x == 'V' else 0 for x in nbadf['WINNER']]
#nbadf['VISITWINNER'] = np.where(nbadf['WINNER']=='V', 1, 0)
nbadf['HOMEWINNER'] = [1 if x == 'H' else 0 for x in nbadf['WINNER']]
#nbadf['HOMEWINNER'] = np.where(nbadf['WINNER']=='H', 1, 0)
nbadf['HOME_TEAM_WINS'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\
['HOMEWINNER'].transform(sum)
nbadf['HOME_TEAM_LOSSES'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\
['VISITWINNER'].transform(sum)
nbadf['VISIT_TEAM_WINS'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
['VISITWINNER'].transform(sum)
nbadf['VISIT_TEAM_LOSSES'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
['HOMEWINNER'].transform(sum)
nbadf.drop(['HOMEWINNER', 'VISITWINNER'],inplace=True,axis=1)
# SEASON ... WINNER HOME_TEAM_WINS HOME_TEAM_LOSSES VISIT_TEAM_WINS VISIT_TEAM_LOSSES
#0 2013 ... V 0 1 1 0
#1 2013 ... V 0 1 1 0
#2 2013 ... V 0 1 1 0
#3 2013 ... H 1 0 0 1
#4 2013 ... V 0 1 1 0
Now for instances of home teams later visiting and vice versa, consider a merge on the IDs with subsetted data frames (change column numbers if needed). This captures home teams who are also visitor teams. So run above aggregates on mergedf
(and calculate same conditional HOMEWINNER
using this time WINNER_x
and VISITWINNER
using WINNER_y
):
# MERGES HOME SUBSET DF AND VISITOR SUBSET DF
mergedf = pd.merge(nbadf[[0,1,2,3,4,6]], nbadf[[0,1,2,3,5,6]],
left_on=['HOME_TEAM_ID'], right_on=['VISITOR_TEAM_ID'], how='inner')
mergedf['HOMETEAM_AS_VISITOR_WINS'] = mergedf.groupby(['VISITOR_TEAM_ID','SEASON_y'])\
['VISITWINNER'].transform(sum)
mergedf['VISITORTEAM_AS_HOME_WINS'] = mergedf.groupby(['HOME_TEAM_ID','SEASON_x'])\
['HOMEWINNER'].transform(sum)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.