如何删除 pandas dataframe 中的重复项，但根据特定列值保留行

Question

I have a pandas dataframe with NBA player stats, and I want to drop the rows of duplicate players.我有一个带有 NBA 球员统计数据的 pandas dataframe，我想删除重复球员的行。 There are duplicates because some players played on multiple teams for the 2020-2021 season, and I want to drop these duplicates.有重复是因为一些球员在 2020-2021 赛季效力于多支球队，我想删除这些重复。 However, for these players that played on multiple teams, there is also a row with that player's combined stats across all teams and a team label of 'TOT', which represents the fact that that player played on 2 or more teams for the season.但是，对于这些曾效力于多支球队的球员，该球员在所有球队的综合统计数据也存在一行，并且球队 label 为“TOT”，这表示该球员本赛季效力于 2 支或更多支球队。 When I drop duplicate players, I want the row with the team of 'TOT' to remain, and all the other duplicates to be gone.当我删除重复的玩家时，我希望保留“TOT”团队的行，而所有其他重复的玩家都消失了。 I'm unsure of how to specify that I want to drop all duplicates, but keep the duplicate where df['Team'] = 'TOT'.我不确定如何指定我要删除所有重复项，但将重复项保留在 df['Team'] = 'TOT' 的位置。

Here is what my dataframe looks like: Dataframe这是我的 dataframe 的样子： Dataframe

In this example, I want to drop the duplicates of the player 'Jarrett Allen', but keep the row for Jarrett Allen where his team (Tm) is 'TOT'.在此示例中，我想删除玩家“Jarrett Allen”的重复项，但保留 Jarrett Allen 所在的行，其中他的团队 (Tm) 为“TOT”。

Answer 1

One way is to use a helper column.一种方法是使用辅助列。 For example with the following df,例如下面的df，

    player  stats team
0      bob      1  ABC
1    alice      2  DEF
2  charlie      3  GHI
3     mary      4  JKL
4     mary      5  MNO
5     mary      6  TOT
6      bob      7  TOT
7      bob      8  VWX

Creating a column where hte value is True if the 'team' value is 'TOT' and False otherwise results in:如果“团队”值为“TOT”，则创建一个 hte 值为 True 的列，否则为 False 会导致：

import numpy as np

df['multiple_teams'] = np.where(df['team']=='TOT', 'TOT', None)

    player  stats team  multiple_teams
1    alice      2  DEF           False
0      bob      1  ABC           False
6      bob      7  TOT            True
7      bob      8  VWX           False
2  charlie      3  GHI           False
3     mary      4  JKL           False
4     mary      5  MNO           False
5     mary      6  TOT            True

Now we can use the keep parameter of the drop_duplicates() function to decide what to keep.现在我们可以使用drop_duplicates() function的keep参数来决定要保留什么。 In this case we can achieve the desired result by dropping the values based on the subset of player and multiple_teams with keep=False .在这种情况下，我们可以通过使用keep=False删除基于player和multiple_teams子集的值来实现所需的结果。 This will mean that all duplicates across both columns will be removed from the df.这将意味着两列中的所有重复项都将从 df 中删除。 Resulting in:导致：

    player  stats team  multiple_teams
1    alice      2  DEF           False
6      bob      7  TOT            True
2  charlie      3  GHI           False
5     mary      6  TOT            True

Answer 2

You can just filter out unnecessary rows:您可以过滤掉不必要的行：

df = df.loc[(df['Rk'].duplicated(keep=False) == False) | (df['Tm'] == 'TOT'), :]

It can be understood this way: From my dataframe take all rows which are not duplicated in column 'Rk' or rows which have 'TOT' in column 'Tm'.可以这样理解：从我的 dataframe 中取出“Rk”列中不重复的所有行或“Tm”列中具有“TOT”的行。

":" at the end means that you want to take all columns.最后的“：”表示您要获取所有列。

Answer 3

You can sort the DataFrame using the key argument, such that 'TOT' is sorted to the bottom and then drop_duplicates, keeping the last.您可以使用key参数对 DataFrame 进行sort ，这样'TOT'被排序到底部，然后 drop_duplicates，保留最后。

This guarantees that in the end there is only a single row per player, even if the data are messy and may have multiple 'TOT' rows for a single player, one team and one 'TOT' row, or multiple teams and multiple 'TOT' rows.这保证了最终每个球员只有一行，即使数据是混乱的，并且可能有多个'TOT'行，一个球员，一个团队和一个'TOT'行，或多个团队和多个'TOT'行。

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#1      bob      7  TOT
#2      bob      1  ABC
#3  charlie      3  GHI
#4     mary      4  JKL
#5     mary      5  MNO
#6     mary      6  TOT

df = (df.sort_values('team', key=lambda x: x.eq('TOT'))
        .drop_duplicates('player', keep='last'))

print(df)
#    player  stats team
#0    alice      2  DEF
#3  charlie      3  GHI
#1      bob      7  TOT
#6     mary      6  TOT

如何删除 pandas dataframe 中的重复项，但根据特定列值保留行

问题描述

3 个解决方案

解决方案1
0 2021-02-01 19:49:50

解决方案2
0 2021-02-01 19:55:58

解决方案3
0 2021-02-01 20:01:26

如何删除 pandas dataframe 中的重复项，但根据特定列值保留行

问题描述

3 个解决方案

解决方案1 0 2021-02-01 19:49:50

解决方案2 0 2021-02-01 19:55:58

解决方案3 0 2021-02-01 20:01:26

解决方案1
0 2021-02-01 19:49:50

解决方案2
0 2021-02-01 19:55:58

解决方案3
0 2021-02-01 20:01:26