I have a question in regards to finding duplicates in a dataframe, and removing duplicates in a dataframe using a specific column. Here is what I am trying to accomplish:
Is it possible to remove duplicates but keep the first 2?
Here is an example of my current dataframe called df and take a look at the bracket notes I have placed below to give you an idea.
Note: If 'Roll' = 1 then I want to look at the Date column, see if there is a second duplicate Date in that column... keep those two and delete any others.
Date Open High Low Close Roll Dupes
1 19780106 236.00 237.50 234.50 235.50 0 NaN
2 19780113 235.50 239.00 235.00 238.25 0 NaN
3 19780120 238.00 239.00 234.50 237.00 0 NaN
4 19780127 237.00 238.50 235.50 236.00 1 NaN (KEEP)
5 19780203 236.00 236.00 232.25 233.50 0 NaN (KEEP)
6 19780127 237.00 238.50 235.50 236.00 0 NaN (KEEP)
7 19780203 236.00 236.00 232.25 233.50 0 NaN (DELETE)
8 19780127 237.00 238.50 235.50 236.00 0 NaN (DELETE)
9 19780203 236.00 236.00 232.25 233.50 0 NaN (DELETE)
This is what is currently removing the dupes BUT it's removing all dupes (obviously)
df = df.drop_duplicates('Date')
EDIT: I forgot to mention something, the only duplicate I want to keep is if column 'Roll' = 1 if it does, then keep that row and the next one that matches based on column 'Date'
Using head
with a groupby keeps the first x entries in each group, which I think accomplishes what you want.
In [52]: df.groupby('Date').head(2)
Out[52]:
Date Open High Low Close Roll
1 19780106 236.0 237.5 234.50 235.50 0
2 19780113 235.5 239.0 235.00 238.25 0
3 19780120 238.0 239.0 234.50 237.00 0
4 19780127 237.0 238.5 235.50 236.00 0
5 19780203 236.0 236.0 232.25 233.50 0
6 19780127 237.0 238.5 235.50 236.00 0
7 19780203 236.0 236.0 232.25 233.50 0
Edit:
In [16]: df['dupe_count'] = df.groupby('Date')['Roll'].transform('max') + 1
In [17]: df.groupby('Date', as_index=False).apply(lambda x: x.head(x['dupe_count'].iloc[0]))
Out[17]:
Date Open High Low Close Roll Dupes dupe_count
0 1 19780106 236.0 237.5 234.50 235.50 0 NaN 1
1 2 19780113 235.5 239.0 235.00 238.25 0 NaN 1
2 3 19780120 238.0 239.0 234.50 237.00 0 NaN 1
3 4 19780127 237.0 238.5 235.50 236.00 1 NaN 2
6 19780127 237.0 238.5 235.50 236.00 0 NaN 2
4 5 19780203 236.0 236.0 232.25 233.50 0 NaN 1
Assuming Roll
can only take the values 0 and 1, if you do
df.groupby(['Date', 'Roll'], as_index=False).first()
you will get two rows for dates for which one of the rows had Roll = 1
and only one row for dates which have only Roll = 0
, which I think is what you want.
If passed as_index=False
so that the group keys don't end up in the index as discussed in your comment.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.