[英]How to drop unique rows in a pandas dataframe?
I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe.我遇到了一个看似简单的问题:在 Pandas 数据框中删除唯一行。 Basically, the opposite of drop_duplicates()
.基本上,与drop_duplicates()
相反。
Let's say this is my data:假设这是我的数据:
A B C
0 foo 0 A
1 foo 1 A
2 foo 1 B
3 bar 1 A
I would like to drop the rows when A, and B are unique, ie I would like to keep only the rows 1 and 2.当 A 和 B 是唯一的时,我想删除行,即我只想保留第 1 行和第 2 行。
I tried the following:我尝试了以下方法:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]
But I only get the row 2, as 0, 1, and 3 are in the uniques!但我只得到第 2 行,因为 0、1 和 3 是唯一的!
Solutions for select all duplicated rows:选择所有重复行的解决方案:
You can use duplicated
with subset and parameter keep=False
for select all duplicates:您可以使用duplicated
与子集和参数keep=False
的选择所有重复:
df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
1 foo 1 A
2 foo 1 B
Solution with transform
: transform
解决方案:
df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
A B C
1 foo 1 A
2 foo 1 B
A bit modified solutions for select all unique rows:选择所有唯一行的一些修改解决方案:
#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
A B C
0 foo 0 A
3 bar 1 A
df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
A B C
0 foo 0 A
3 bar 1 A
I came up with a solution using groupby
:我想出了一个使用groupby
的解决方案:
groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]
Duplicates now has the proper result: Duplicates 现在有正确的结果:
A B C
2 foo 1 B
3 bar 1 A
Also, my original attempt in the question can be fixed by simply adding keep=False
in the drop_duplicates
method:另外,我在这个问题中的最初尝试可以通过在drop_duplicates
方法中简单地添加keep=False
来drop_duplicates
:
# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]
Please @jezrael answer, I think it is safest(?), as I am using pandas indexes here.请@jezrael 回答,我认为这是最安全的(?),因为我在这里使用了熊猫索引。
df1 = df.drop_duplicates(['A', 'B'],keep=False)
df1 = pd.concat([df, df1])
df1 = df1.drop_duplicates(keep=False)
This technique is more suitable when you have two datasets dfX
and dfY
with millions of records.当您有两个包含数百万条记录的数据集dfX
和dfY
时,此技术更合适。 You may first concatenate dfX
and dfY
and follow the same steps.您可以先连接dfX
和dfY
然后执行相同的步骤。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.