简体   繁体   English

如何根据 Pandas 数据帧中的两个或多个子集标准删除重复项

[英]How to drop duplicates based on two or more subsets criteria in Pandas data-frame

Lets say this is my data-frame可以说这是我的数据框

df = pd.DataFrame({ 'bio' : ['1', '1', '1', '4'],
                'center' : ['one', 'one', 'two', 'three'],
                'outcome' : ['f','t','f','f'] })

It looks like this...看起来像这样...

  bio center outcome
0   1    one       f
1   1    one       t
2   1    two       f
3   4  three       f

I want to drop row 1 because it has the same bio & center as row 0. I want to keep row 2 because it has the same bio but different center then row 0.我想删除第 1 行,因为它与第 0 行具有相同的生物和中心。我想保留第 2 行,因为它与第 0 行具有相同的生物但不同的中心。

Something like this won't work based on drop_duplicates input structure but it's what I am trying to do像这样的东西不会基于 drop_duplicates 输入结构工作,但这是我想要做的

df.drop_duplicates(subset = 'bio' & subset = 'center' )

Any suggestions?有什么建议么?

edit: changed df a bit to fit example by correct answer编辑:改变 df 以适应正确答案的例子

Your syntax is wrong.你的语法是错误的。 Here's the correct way:这是正确的方法:

df.drop_duplicates(subset=['bio', 'center', 'outcome'])

Or in this specific case, just simply:或者在这种特定情况下,只需简单地:

df.drop_duplicates()

Both return the following:两者都返回以下内容:

  bio center outcome
0   1    one       f
2   1    two       f
3   4  three       f

Take a look at the df.drop_duplicates documentation for syntax details.查看df.drop_duplicates 文档了解语法细节。 subset should be a sequence of column labels. subset应该是一系列列标签。

The previous Answer was very helpful.上一个答案非常有帮助。 It helped me.它帮助了我。 I also needed to add something in code to get what I wanted.我还需要在代码中添加一些东西来获得我想要的东西。 So, I wanted to add here that.所以,我想在这里补充一下。

The data-frame:数据框:

  bio center outcome
0   1    one       f
1   1    one       t
2   1    two       f
3   4  three       f

After implementing drop_duplicates :实施drop_duplicates后:

  bio center outcome
0   1    one       f
2   1    two       f
3   4  three       f

Notice at the index.注意索引。 They got messed up.他们搞砸了。 If anyone wants to back the normal indexes ie 0, 1, 2 from 0, 2, 3 :如果有人想从0, 2, 3支持正常索引,即0, 1, 2

df.drop_duplicates(subset=['bio', 'center', 'outcome'], ignore_index=True)

Output: Output:

  bio center outcome
0   1    one       f
1   1    two       f
2   4  three       f

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM