简体   繁体   中英

Pandas drop duplicates does not behave as expected

I have a pandas series whose index contains several repeated items and I use drop_duplicates to have its index available for further slicing on other series/dataframes:

In[1]: test
Out[1]: 
5575    21010210
5575    21010210
5577    21010210
5577    21010210
5577    21010210
5583    21010210
5583    21010210
5583    21010210
5586    21010210
5586    21010210
5586    21010210
8545    21010210
8545    21010210
8718    21000102
8718    21000102
8721    21000102
8721    21000102
Name: CC, dtype: object

When I apply test.drop_duplicates() , I would expect all existing indices to remain, albeit without repetition. For some reason, pandas is not recognising some of those indices as duplicates and simply purges them from the dataframe:

In[2]: test.drop_duplicates()
Out[2]: 
5575    21010210
8718    21000102
Name: CC, dtype: object

Curiously, if I reset the index before, the drop_duplicates method will work correctly:

In[3]: test.reset_index().drop_duplicates()
Out[3]: 
    index        CC
0    5575  21010210
2    5577  21010210
5    5583  21010210
8    5586  21010210
11   8545  21010210
13   8718  21000102
15   8721  21000102

Any reasons why pandas will simply remove some of the indices from the operation? How do I effectively drop those duplicates without reseting the index?

So here's your pandas Series object:

import pandas as pd

data = [
    21010210, 21010210, 21010210, 21010210, 21010210, 21010210, 
    21010210, 21010210,  21010210, 21010210, 21010210, 21010210, 
    21010210, 21000102, 21000102, 21000102, 21000102
]

idx = [
    5575, 5575, 5577, 5577, 5577, 5583, 5583, 5583, 
    5586, 5586, 5586, 8545, 8545, 8718, 8718, 8721, 8721
]

series = pd.Series(data, index=idx).rename("CC")

print(series)

>>>
5575    21010210
5575    21010210
5577    21010210
5577    21010210
5577    21010210
5583    21010210
5583    21010210
5583    21010210
5586    21010210
5586    21010210
5586    21010210
8545    21010210
8545    21010210
8718    21000102
8718    21000102
8721    21000102
8721    21000102
Name: CC, dtype: int64

Now, if you run drop_duplicates() , this will ignore your index:

Return DataFrame with duplicate rows removed, optionally only considering certain columns. Indexes, including time indexes are ignored

print(series.drop_duplicates())

5575    21010210
8718    21000102
Name: CC, dtype: int64

Finally, reset_index() will return a dataframe where the previous index is inserted into dataframe columns and the index will reset:

print(series.reset_index())
    index        CC
0    5575  21010210
1    5575  21010210
2    5577  21010210
3    5577  21010210
4    5577  21010210
5    5583  21010210
6    5583  21010210
7    5583  21010210
8    5586  21010210
9    5586  21010210
10   5586  21010210
11   8545  21010210
12   8545  21010210
13   8718  21000102
14   8718  21000102
15   8721  21000102
16   8721  21000102

Reset the index of the DataFrame, and use the default one instead .

This means that drop_duplicates() will now consider both columns.

print(series.reset_index().drop_duplicates())
    index        CC
0    5575  21010210
2    5577  21010210
5    5583  21010210
8    5586  21010210
11   8545  21010210
13   8718  21000102
15   8721  21000102

The most effective way to do it is

print(series.loc[~series.index.duplicated()])
5575    21010210
5577    21010210
5583    21010210
5586    21010210
8545    21010210
8718    21000102
8721    21000102
Name: CC, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM