Lets say this is my data-frame
df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'],
'center' : ['one', 'one', 'two', 'three'],
'outcome' : ['f','t','f','f'] })
It looks like this...
bio center outcome
0 1 one f
1 1 one t
2 1 two f
3 4 three f
I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0.
Something like this won't work based on drop_duplicates input structure but it's what I am trying to do
df.drop_duplicates(subset = 'bio' & subset = 'center' )
Any suggestions?
edit: changed df a bit to fit example by correct answer
Your syntax is wrong. Here's the correct way:
df.drop_duplicates(subset=['bio', 'center', 'outcome'])
Or in this specific case, just simply:
df.drop_duplicates()
Both return the following:
bio center outcome
0 1 one f
2 1 two f
3 4 three f
Take a look at the df.drop_duplicates
documentation for syntax details. subset
should be a sequence of column labels.
The previous Answer was very helpful. It helped me. I also needed to add something in code to get what I wanted. So, I wanted to add here that.
The data-frame:
bio center outcome
0 1 one f
1 1 one t
2 1 two f
3 4 three f
After implementing drop_duplicates
:
bio center outcome
0 1 one f
2 1 two f
3 4 three f
Notice at the index. They got messed up. If anyone wants to back the normal indexes ie 0, 1, 2
from 0, 2, 3
:
df.drop_duplicates(subset=['bio', 'center', 'outcome'], ignore_index=True)
Output:
bio center outcome
0 1 one f
1 1 two f
2 4 three f
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.