I have a pandas series whose index contains several repeated items and I use drop_duplicates
to have its index available for further slicing on other series/dataframes:
In[1]: test
Out[1]:
5575 21010210
5575 21010210
5577 21010210
5577 21010210
5577 21010210
5583 21010210
5583 21010210
5583 21010210
5586 21010210
5586 21010210
5586 21010210
8545 21010210
8545 21010210
8718 21000102
8718 21000102
8721 21000102
8721 21000102
Name: CC, dtype: object
When I apply test.drop_duplicates()
, I would expect all existing indices to remain, albeit without repetition. For some reason, pandas is not recognising some of those indices as duplicates and simply purges them from the dataframe:
In[2]: test.drop_duplicates()
Out[2]:
5575 21010210
8718 21000102
Name: CC, dtype: object
Curiously, if I reset the index before, the drop_duplicates
method will work correctly:
In[3]: test.reset_index().drop_duplicates()
Out[3]:
index CC
0 5575 21010210
2 5577 21010210
5 5583 21010210
8 5586 21010210
11 8545 21010210
13 8718 21000102
15 8721 21000102
Any reasons why pandas will simply remove some of the indices from the operation? How do I effectively drop those duplicates without reseting the index?
So here's your pandas Series
object:
import pandas as pd
data = [
21010210, 21010210, 21010210, 21010210, 21010210, 21010210,
21010210, 21010210, 21010210, 21010210, 21010210, 21010210,
21010210, 21000102, 21000102, 21000102, 21000102
]
idx = [
5575, 5575, 5577, 5577, 5577, 5583, 5583, 5583,
5586, 5586, 5586, 8545, 8545, 8718, 8718, 8721, 8721
]
series = pd.Series(data, index=idx).rename("CC")
print(series)
>>>
5575 21010210
5575 21010210
5577 21010210
5577 21010210
5577 21010210
5583 21010210
5583 21010210
5583 21010210
5586 21010210
5586 21010210
5586 21010210
8545 21010210
8545 21010210
8718 21000102
8718 21000102
8721 21000102
8721 21000102
Name: CC, dtype: int64
Now, if you run drop_duplicates()
, this will ignore your index:
Return
DataFrame
with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored
print(series.drop_duplicates())
5575 21010210
8718 21000102
Name: CC, dtype: int64
Finally, reset_index()
will return a dataframe
where the previous index is inserted into dataframe columns and the index will reset:
print(series.reset_index())
index CC
0 5575 21010210
1 5575 21010210
2 5577 21010210
3 5577 21010210
4 5577 21010210
5 5583 21010210
6 5583 21010210
7 5583 21010210
8 5586 21010210
9 5586 21010210
10 5586 21010210
11 8545 21010210
12 8545 21010210
13 8718 21000102
14 8718 21000102
15 8721 21000102
16 8721 21000102
Reset the index of the DataFrame, and use the default one instead .
This means that drop_duplicates()
will now consider both columns.
print(series.reset_index().drop_duplicates())
index CC
0 5575 21010210
2 5577 21010210
5 5583 21010210
8 5586 21010210
11 8545 21010210
13 8718 21000102
15 8721 21000102
The most effective way to do it is
print(series.loc[~series.index.duplicated()])
5575 21010210
5577 21010210
5583 21010210
5586 21010210
8545 21010210
8718 21000102
8721 21000102
Name: CC, dtype: int64
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.