简体   繁体   中英

How to drop unique rows in a pandas dataframe?

I am stuck with a seemingly easy problem: dropping unique rows in a pandas dataframe. Basically, the opposite of drop_duplicates() .

Let's say this is my data:

    A       B   C  
0   foo     0   A
1   foo     1   A
2   foo     1   B
3   bar     1   A

I would like to drop the rows when A, and B are unique, ie I would like to keep only the rows 1 and 2.

I tried the following:

# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})

uniques = df[['A', 'B']].drop_duplicates()
duplicates = df[~df.index.isin(uniques.index)]

But I only get the row 2, as 0, 1, and 3 are in the uniques!

Solutions for select all duplicated rows:

You can use duplicated with subset and parameter keep=False for select all duplicates:

df = df[df.duplicated(subset=['A','B'], keep=False)]
print (df)
     A  B  C
1  foo  1  A
2  foo  1  B

Solution with transform :

df = df[df.groupby(['A', 'B'])['A'].transform('size') > 1]
print (df)
     A  B  C
1  foo  1  A
2  foo  1  B

A bit modified solutions for select all unique rows:

#invert boolean mask by ~
df = df[~df.duplicated(subset=['A','B'], keep=False)]
print (df)
     A  B  C
0  foo  0  A
3  bar  1  A

df = df[df.groupby(['A', 'B'])['A'].transform('size') == 1]
print (df)
     A  B  C
0  foo  0  A
3  bar  1  A

I came up with a solution using groupby :

groupped = df.groupby(['A', 'B']).size().reset_index().rename(columns={0: 'count'})
uniques = groupped[groupped['count'] == 1]
duplicates = df[~df.index.isin(uniques.index)]

Duplicates now has the proper result:

    A       B   C
2   foo     1   B
3   bar     1   A

Also, my original attempt in the question can be fixed by simply adding keep=False in the drop_duplicates method:

# Load Dataframe
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})

uniques = df[['A', 'B']].drop_duplicates(keep=False)
duplicates = df[~df.index.isin(uniques.index)]

Please @jezrael answer, I think it is safest(?), as I am using pandas indexes here.

df1 = df.drop_duplicates(['A', 'B'],keep=False)

df1 = pd.concat([df, df1])

df1 = df1.drop_duplicates(keep=False)

This technique is more suitable when you have two datasets dfX and dfY with millions of records. You may first concatenate dfX and dfY and follow the same steps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM