如何删除熊猫数据框中的唯一行？

Question

I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe.我遇到了一个看似简单的问题：在 Pandas 数据框中删除唯一行。 Basically, the opposite of drop_duplicates() .基本上，与drop_duplicates()相反。

Let's say this is my data:假设这是我的数据：

    A       B   C  
0   foo     0   A
1   foo     1   A
2   foo     1   B
3   bar     1   A

I would like to drop the rows when A, and B are unique, ie I would like to keep only the rows 1 and 2.当 A 和 B 是唯一的时，我想删除行，即我只想保留第 1 行和第 2 行。

I tried the following:我尝试了以下方法：

# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})

uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]

But I only get the row 2, as 0, 1, and 3 are in the uniques!但我只得到第 2 行，因为 0、1 和 3 是唯一的！

Answer 1

Solutions for select all duplicated rows:选择所有重复行的解决方案：

You can use duplicated with subset and parameter keep=False for select all duplicates:您可以使用duplicated与子集和参数keep=False的选择所有重复：

df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
     A  B  C
1  foo  1  A
2  foo  1  B

Solution with transform : transform解决方案：

df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
     A  B  C
1  foo  1  A
2  foo  1  B

A bit modified solutions for select all unique rows:选择所有唯一行的一些修改解决方案：

#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
     A  B  C
0  foo  0  A
3  bar  1  A

df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
     A  B  C
0  foo  0  A
3  bar  1  A

Answer 2

I came up with a solution using groupby :我想出了一个使用groupby的解决方案：

groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]

Duplicates now has the proper result: Duplicates 现在有正确的结果：

    A       B   C
2   foo     1   B
3   bar     1   A

Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:另外，我在这个问题中的最初尝试可以通过在drop_duplicates方法中简单地添加keep=False来drop_duplicates ：

# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})

uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]

Please @jezrael answer, I think it is safest(?), as I am using pandas indexes here.请@jezrael 回答，我认为这是最安全的（？），因为我在这里使用了熊猫索引。

Answer 3

df1 = df.drop_duplicates(['A', 'B'],keep=False)

df1 = pd.concat([df, df1])

df1 = df1.drop_duplicates(keep=False)

This technique is more suitable when you have two datasets dfX and dfY with millions of records.当您有两个包含数百万条记录的数据集dfX和dfY时，此技术更合适。 You may first concatenate dfX and dfY and follow the same steps.您可以先连接dfX和dfY然后执行相同的步骤。

如何删除熊猫数据框中的唯一行？

问题描述

3 个解决方案

解决方案1
18 已采纳 2017-07-03 15:08:01

解决方案2
2 2017-07-03 15:04:52

解决方案3
0 2021-12-30 18:21:59

如何删除熊猫数据框中的唯一行？

问题描述

3 个解决方案

解决方案1 18 已采纳 2017-07-03 15:08:01

解决方案2 2 2017-07-03 15:04:52

解决方案3 0 2021-12-30 18:21:59

解决方案1
18 已采纳 2017-07-03 15:08:01

解决方案2
2 2017-07-03 15:04:52

解决方案3
0 2021-12-30 18:21:59