简体   繁体   中英

How can I add new columns efficiently using groupby in Pandas?

I am using nba_py to get the scoreboard data for some NBA matches.

Below is an example of how the data are structured:

    SEASON |     GAME_DATE_EST | GAME_SEQUENCE | GAME_ID | HOME_TEAM_ID | VISITOR_TEAM_ID | WINNER

0   2013    2013-10-05T00:00:00     1            11300001   12321         1610612760        V
1   2013    2013-10-05T00:00:00     2            11300002   1610612754    1610612741        V
2   2013    2013-10-05T00:00:00     3            11300003   1610612745    1610612740        V
3   2013    2013-10-05T00:00:00     4            11300004   1610612747    1610612744        H
4   2013    2013-10-06T00:00:00     1            11300005   12324         1610612755        V

You can find a part of the data here: NBA Games Data .

My aim is to create and add to the original data the following columns:

For the hometeam:

   1. Total wins/losses for hometeam if hometeam plays at home ("HOMETEAM_HOME_WINS"/"HOMETEAM_HOME_LOSSES")
   2. Total wins/losses for hometeam if hometeam is visiting ("HOMETEAM_VISITOR_WINS"/"HOMETEAM_VISITOR_LOSSES")

For the visitor_team:

   3. Total wins/losses for visitor_team if visitor_team plays at home ("VISITOR_TEAM_HOME_WINS"/"VISITOR_TEAM_HOME_LOSSES")
   4. Total wins/losses for visitor_team if visitor_team is visiting ("VISITOR_TEAM_VISITOR_WINS"/"VISITOR_TEAM_VISITOR_LOSSES")

My first simplistic approach is below:

def get_home_team_home_wins(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name
    season_hometeam_games = grouped_seasons_hometeams.get_group((season, hometeam))
    home_games = season_hometeam_games[(season_hometeam_games.index < gid)]

    if not home_games.empty:
        try:
            home_wins = home_games.FTR.value_counts()["H"]
        except Exception as e:
            home_wins = 0
    else:
        home_wins = 0

grouped_seasons_hometeams = df.groupby(["SEASON", "HOME_TEAM_ID"])

df["HOMETEAM_HOME_WINS"] = df.apply(lambda x: get_home_team_home_wins(x), axis=1)

Another approach is iterating over the rows :

grouped_seasons = df.groupby("SEASON")
df["HOMETEAM_HOME_WINS"] = 0

current_season = 0
for index,row in df.iterrows():
    season = row.SEASON
    if season != current_season:
        current_season = season
        season_games = grouped_seasons.get_group(current_season)

    hometeam = row.HOME_TEAM_ID
    gid = row.name
    games = season_games[(season_games.index < gid)]
    home_games = games[(games.HOME_TEAM_ID == hometeam)]

    if not home_games.empty:
        try:
            home_wins = home_games.FTR.value_counts()["H"]
        except Exception as e:
            home_wins = 0
    else:
        home_wins = 0

    row["HOME_TEAM_HOME_WINS_4"] = home_wins
    df.ix[index] = row

Update 1:

Below there are functions for getting wins/losses for hometeam if it plays at home and if it visits. A similar one would be for the visitor_team.

def get_home_team_home_wins_losses(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name

    games = df[(df.SEASON == season) & (df.index < gid)]
    home_team_home_games = games[(games.HOME_TEAM_ID == hometeam)]  


    # HOMETEAM plays at home
    if not home_team_home_games.empty:
        home_team_home_games_value_counts = home_team_home_games.FTR.value_counts()

        try:
            home_team_home_wins = home_team_home_games_value_counts["H"]
        except Exception as e:
            home_team_home_wins = 0

        try:
            home_team_home_losses = home_team_home_games_value_counts["V"]
        except Exception as e:
            home_team_home_losses = 0
    else:
        home_team_home_wins = 0
        home_team_home_losses = 0

    return [home_team_home_wins, home_team_home_losses]

def get_home_team_visitor_wins_losses(x):
    hometeam = x.HOME_TEAM_ID
    season = x.SEASON
    gid = x.name

    games = df[(df.SEASON == season) & (df.index < gid)]
    home_team_visitor_games = games[(games.VISITOR_TEAM_ID == hometeam)]

    # HOMETEAM visits
    if not home_team_visitor_games.empty:
        home_team_visitor_games_value_counts = home_team_visitor_games.FTR.value_counts()

        try:
            home_team_visitor_wins = home_team_visitor_games_value_counts["V"]
        except Exception as e:
            home_team_visitor_wins = 0

        try:
            home_team_visitor_losses = home_team_visitor_games_value_counts["H"]
        except Exception as e:
            home_team_visitor_losses = 0
    else:
        home_team_visitor_wins = 0
        home_team_visitor_losses = 0    

    return [home_team_visitor_wins, home_team_visitor_losses]

df["HOME_TEAM_HOME_WINS"], df["HOME_TEAM_HOME_LOSSES"] = zip(*df.apply(lambda x: get_home_team_home_wins_losses(x), axis=1))
df["HOME_TEAM_VISITOR_WINS"], df["HOME_TEAM_VISITOR_LOSSES"] = zip(*df.apply(lambda x: get_home_team_visitor_wins_losses(x), axis=1))
df["HOME_TEAM_WINS"] = df["HOME_TEAM_HOME_WINS"] + df["HOME_TEAM_VISITOR_WINS"]
df["HOME_TEAM_LOSSES"] = df["HOME_TEAM_HOME_LOSSES"] + df["HOME_TEAM_VISITOR_LOSSES"]

The above approaches are not efficient. So, I am thinking of using groupby but it's not really clear how.

I will add updates whenever I find something more efficient.

Any ideas ? Thanks.

Consider using transform() but first conditionally create HOMEWINNER and VISITWINNER integer columns. Commented out are easier to read equivalent if/else calculations using numpy.where() which you may/may not have available as a package.

Do note transform() retains all rows but will aggregate by the IDs, so every record of a particular HOME_TEAM_ID should repeat values in these aggregate columns.:

nbadf['VISITWINNER'] =  [1 if x == 'V' else 0 for x in nbadf['WINNER']]
#nbadf['VISITWINNER'] = np.where(nbadf['WINNER']=='V', 1, 0)

nbadf['HOMEWINNER'] = [1 if x == 'H' else 0 for x in nbadf['WINNER']]    
#nbadf['HOMEWINNER'] = np.where(nbadf['WINNER']=='H', 1, 0)

nbadf['HOME_TEAM_WINS'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\ 
                                        ['HOMEWINNER'].transform(sum)
nbadf['HOME_TEAM_LOSSES'] = nbadf.groupby(['HOME_TEAM_ID','SEASON'])\
                                          ['VISITWINNER'].transform(sum)

nbadf['VISIT_TEAM_WINS'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
                                         ['VISITWINNER'].transform(sum)
nbadf['VISIT_TEAM_LOSSES'] = nbadf.groupby(['VISITOR_TEAM_ID','SEASON'])\
                                           ['HOMEWINNER'].transform(sum)

nbadf.drop(['HOMEWINNER', 'VISITWINNER'],inplace=True,axis=1)

#   SEASON  ...  WINNER  HOME_TEAM_WINS  HOME_TEAM_LOSSES  VISIT_TEAM_WINS  VISIT_TEAM_LOSSES
#0    2013  ...      V               0                 1                1                  0
#1    2013  ...      V               0                 1                1                  0
#2    2013  ...      V               0                 1                1                  0
#3    2013  ...      H               1                 0                0                  1
#4    2013  ...      V               0                 1                1                  0

Now for instances of home teams later visiting and vice versa, consider a merge on the IDs with subsetted data frames (change column numbers if needed). This captures home teams who are also visitor teams. So run above aggregates on mergedf (and calculate same conditional HOMEWINNER using this time WINNER_x and VISITWINNER using WINNER_y ):

# MERGES HOME SUBSET DF AND VISITOR SUBSET DF
mergedf = pd.merge(nbadf[[0,1,2,3,4,6]], nbadf[[0,1,2,3,5,6]],
                   left_on=['HOME_TEAM_ID'], right_on=['VISITOR_TEAM_ID'], how='inner')

mergedf['HOMETEAM_AS_VISITOR_WINS'] = mergedf.groupby(['VISITOR_TEAM_ID','SEASON_y'])\ 
                                                      ['VISITWINNER'].transform(sum)

mergedf['VISITORTEAM_AS_HOME_WINS'] = mergedf.groupby(['HOME_TEAM_ID','SEASON_x'])\ 
                                                      ['HOMEWINNER'].transform(sum)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM