如何在熊猫数据框中保留前两个重复项？

Question

I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. 关于在数据框中查找重复项并使用特定列删除数据框中的重复项，我有一个问题。 Here is what I am trying to accomplish: 这是我要完成的工作：

Is it possible to remove duplicates but keep the first 2? 是否可以删除重复项但保留前两个？

Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea. 这是我当前的数据框df的示例，请看一下下面放置的方括号内的注释，以使您有所了解。

Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others. 注意：如果'Roll'= 1，那么我想查看Date列，看看该列中是否还有第二个重复的Date ...保留这两个并删除其他任何日期。

    Date    Open    High     Low      Close  Roll  Dupes
1  19780106  236.00  237.50  234.50  235.50     0    NaN
2  19780113  235.50  239.00  235.00  238.25     0    NaN
3  19780120  238.00  239.00  234.50  237.00     0    NaN
4  19780127  237.00  238.50  235.50  236.00     1    NaN (KEEP)  
5  19780203  236.00  236.00  232.25  233.50     0    NaN (KEEP)
6  19780127  237.00  238.50  235.50  236.00     0    NaN (KEEP)
7  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)
8  19780127  237.00  238.50  235.50  236.00     0    NaN (DELETE)
9  19780203  236.00  236.00  232.25  233.50     0    NaN (DELETE)

This is what is currently removing the dupes BUT it's removing all dupes (obviously) 这是当前正在删除重复对象的东西，但正在删除所有重复对象（显然）

df = df.drop_duplicates('Date')

EDIT: I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date' 编辑：我忘了提些什么，我要保留的唯一重复项是，如果列'Roll'= 1，如果保留的话，则保留该行以及根据列'Date'匹配的下一行

Answer 1

Using head with a groupby keeps the first x entries in each group, which I think accomplishes what you want. 将head与groupby一起使用可在每个组中保留前x个条目，我认为这可以满足您的要求。

In [52]: df.groupby('Date').head(2)
Out[52]: 
       Date   Open   High     Low   Close  Roll
1  19780106  236.0  237.5  234.50  235.50     0
2  19780113  235.5  239.0  235.00  238.25     0
3  19780120  238.0  239.0  234.50  237.00     0
4  19780127  237.0  238.5  235.50  236.00     0
5  19780203  236.0  236.0  232.25  233.50     0
6  19780127  237.0  238.5  235.50  236.00     0
7  19780203  236.0  236.0  232.25  233.50     0

Edit: 编辑：

In [16]: df['dupe_count'] = df.groupby('Date')['Roll'].transform('max') + 1

In [17]: df.groupby('Date', as_index=False).apply(lambda x: x.head(x['dupe_count'].iloc[0]))
Out[17]: 
         Date   Open   High     Low   Close  Roll  Dupes  dupe_count
0 1  19780106  236.0  237.5  234.50  235.50     0    NaN           1
1 2  19780113  235.5  239.0  235.00  238.25     0    NaN           1
2 3  19780120  238.0  239.0  234.50  237.00     0    NaN           1
3 4  19780127  237.0  238.5  235.50  236.00     1    NaN           2
  6  19780127  237.0  238.5  235.50  236.00     0    NaN           2
4 5  19780203  236.0  236.0  232.25  233.50     0    NaN           1

Answer 2

Assuming Roll can only take the values 0 and 1, if you do 假设Roll只能取值0和1，如果您这样做

df.groupby(['Date', 'Roll'], as_index=False).first()

you will get two rows for dates for which one of the rows had Roll = 1 and only one row for dates which have only Roll = 0 , which I think is what you want. 您将获得两行日期，其中某一行的Roll = 1 ，只有一行日期的Roll = 0 ，这就是您想要的。
If passed as_index=False so that the group keys don't end up in the index as discussed in your comment. 如果通过as_index=False传递，则组密钥不会像您的注释中所讨论的那样最终出现在索引中。

如何在熊猫数据框中保留前两个重复项？

问题描述

2 个解决方案

解决方案1
2 2015-09-11 19:47:35

解决方案2
1 已采纳 2015-09-11 21:28:27

如何在熊猫数据框中保留前两个重复项？

问题描述

2 个解决方案

解决方案1 2 2015-09-11 19:47:35

解决方案2 1 已采纳 2015-09-11 21:28:27

解决方案1
2 2015-09-11 19:47:35

解决方案2
1 已采纳 2015-09-11 21:28:27