I have a pandas dataframe with NBA player stats, and I want to drop the rows of duplicate players. There are duplicates because some players played on multiple teams for the 2020-2021 season, and I want to drop these duplicates. However, for these players that played on multiple teams, there is also a row with that player's combined stats across all teams and a team label of 'TOT', which represents the fact that that player played on 2 or more teams for the season. When I drop duplicate players, I want the row with the team of 'TOT' to remain, and all the other duplicates to be gone. I'm unsure of how to specify that I want to drop all duplicates, but keep the duplicate where df['Team'] = 'TOT'.
Here is what my dataframe looks like: Dataframe
In this example, I want to drop the duplicates of the player 'Jarrett Allen', but keep the row for Jarrett Allen where his team (Tm) is 'TOT'.
One way is to use a helper column. For example with the following df,
player stats team
0 bob 1 ABC
1 alice 2 DEF
2 charlie 3 GHI
3 mary 4 JKL
4 mary 5 MNO
5 mary 6 TOT
6 bob 7 TOT
7 bob 8 VWX
Creating a column where hte value is True if the 'team' value is 'TOT' and False otherwise results in:
import numpy as np
df['multiple_teams'] = np.where(df['team']=='TOT', 'TOT', None)
player stats team multiple_teams
1 alice 2 DEF False
0 bob 1 ABC False
6 bob 7 TOT True
7 bob 8 VWX False
2 charlie 3 GHI False
3 mary 4 JKL False
4 mary 5 MNO False
5 mary 6 TOT True
Now we can use the keep
parameter of the drop_duplicates() function to decide what to keep. In this case we can achieve the desired result by dropping the values based on the subset of player
and multiple_teams
with keep=False
. This will mean that all duplicates across both columns will be removed from the df. Resulting in:
player stats team multiple_teams
1 alice 2 DEF False
6 bob 7 TOT True
2 charlie 3 GHI False
5 mary 6 TOT True
You can just filter out unnecessary rows:
df = df.loc[(df['Rk'].duplicated(keep=False) == False) | (df['Tm'] == 'TOT'), :]
It can be understood this way: From my dataframe take all rows which are not duplicated in column 'Rk' or rows which have 'TOT' in column 'Tm'.
":" at the end means that you want to take all columns.
You can sort
the DataFrame using the key
argument, such that 'TOT'
is sorted to the bottom and then drop_duplicates, keeping the last.
This guarantees that in the end there is only a single row per player, even if the data are messy and may have multiple 'TOT'
rows for a single player, one team and one 'TOT'
row, or multiple teams and multiple 'TOT'
rows.
df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
.drop_duplicates('player', keep='last'))
print(df)
# player stats team
#0 alice 2 DEF
#1 bob 7 TOT
#2 bob 1 ABC
#3 charlie 3 GHI
#4 mary 4 JKL
#5 mary 5 MNO
#6 mary 6 TOT
df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
.drop_duplicates('player', keep='last'))
print(df)
# player stats team
#0 alice 2 DEF
#3 charlie 3 GHI
#1 bob 7 TOT
#6 mary 6 TOT
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.