简体   繁体   English

如何删除 pandas dataframe 中的重复项,但根据特定列值保留行

[英]How to drop duplicates in pandas dataframe but keep row based on specific column value

I have a pandas dataframe with NBA player stats, and I want to drop the rows of duplicate players.我有一个带有 NBA 球员统计数据的 pandas dataframe,我想删除重复球员的行。 There are duplicates because some players played on multiple teams for the 2020-2021 season, and I want to drop these duplicates.有重复是因为一些球员在 2020-2021 赛季效力于多支球队,我想删除这些重复。 However, for these players that played on multiple teams, there is also a row with that player's combined stats across all teams and a team label of 'TOT', which represents the fact that that player played on 2 or more teams for the season.但是,对于这些曾效力于多支球队的球员,该球员在所有球队的综合统计数据也存在一行,并且球队 label 为“TOT”,这表示该球员本赛季效力于 2 支或更多支球队。 When I drop duplicate players, I want the row with the team of 'TOT' to remain, and all the other duplicates to be gone.当我删除重复的玩家时,我希望保留“TOT”团队的行,而所有其他重复的玩家都消失了。 I'm unsure of how to specify that I want to drop all duplicates, but keep the duplicate where df['Team'] = 'TOT'.我不确定如何指定我要删除所有重复项,但将重复项保留在 df['Team'] = 'TOT' 的位置。

Here is what my dataframe looks like: Dataframe这是我的 dataframe 的样子: Dataframe

In this example, I want to drop the duplicates of the player 'Jarrett Allen', but keep the row for Jarrett Allen where his team (Tm) is 'TOT'.在此示例中,我想删除玩家“Jarrett Allen”的重复项,但保留 Jarrett Allen 所在的行,其中他的团队 (Tm) 为“TOT”。

One way is to use a helper column.一种方法是使用辅助列。 For example with the following df,例如下面的df,

    player  stats team
0      bob      1  ABC
1    alice      2  DEF
2  charlie      3  GHI
3     mary      4  JKL
4     mary      5  MNO
5     mary      6  TOT
6      bob      7  TOT
7      bob      8  VWX

Creating a column where hte value is True if the 'team' value is 'TOT' and False otherwise results in:如果“团队”值为“TOT”,则创建一个 hte 值为 True 的列,否则为 False 会导致:

import numpy as np

df['multiple_teams'] = np.where(df['team']=='TOT', 'TOT', None)

    player  stats team  multiple_teams
1    alice      2  DEF           False
0      bob      1  ABC           False
6      bob      7  TOT            True
7      bob      8  VWX           False
2  charlie      3  GHI           False
3     mary      4  JKL           False
4     mary      5  MNO           False
5     mary      6  TOT            True

Now we can use the keep parameter of the drop_duplicates() function to decide what to keep.现在我们可以使用drop_duplicates() functionkeep参数来决定要保留什么。 In this case we can achieve the desired result by dropping the values based on the subset of player and multiple_teams with keep=False .在这种情况下,我们可以通过使用keep=False删除基于playermultiple_teams子集的值来实现所需的结果。 This will mean that all duplicates across both columns will be removed from the df.这将意味着两列中的所有重复项都将从 df 中删除。 Resulting in:导致:

    player  stats team  multiple_teams
1    alice      2  DEF           False
6      bob      7  TOT            True
2  charlie      3  GHI           False
5     mary      6  TOT            True

You can just filter out unnecessary rows:您可以过滤掉不必要的行:

df = df.loc[(df['Rk'].duplicated(keep=False) == False) | (df['Tm'] == 'TOT'), :]

It can be understood this way: From my dataframe take all rows which are not duplicated in column 'Rk' or rows which have 'TOT' in column 'Tm'.可以这样理解:从我的 dataframe 中取出“Rk”列中不重复的所有行或“Tm”列中具有“TOT”的行。

":" at the end means that you want to take all columns.最后的“:”表示您要获取所有列。

You can sort the DataFrame using the key argument, such that 'TOT' is sorted to the bottom and then drop_duplicates, keeping the last.您可以使用key参数对 DataFrame 进行sort ,这样'TOT'被排序到底部,然后 drop_duplicates,保留最后。

This guarantees that in the end there is only a single row per player, even if the data are messy and may have multiple 'TOT' rows for a single player, one team and one 'TOT' row, or multiple teams and multiple 'TOT' rows.这保证了最终每个球员只有一行,即使数据是混乱的,并且可能有多个'TOT'行,一个球员,一个团队和一个'TOT'行,或多个团队和多个'TOT'行。

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#1      bob      7  TOT
#2      bob      1  ABC
#3  charlie      3  GHI
#4     mary      4  JKL
#5     mary      5  MNO
#6     mary      6  TOT

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#3  charlie      3  GHI
#1      bob      7  TOT
#6     mary      6  TOT

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM