简体   繁体   中英

Pandas drop_duplicates drops too many rows

I have a dataset of liked and unliked songs. There are 8764 liked and 2213 unliked songs, 11000 rows in total. I have many duplicate like songs but I expected the duplicates to be around max 2000-5000 songs and I'm pretty sure there aren't any duplicate unliked songs. However when I drop duplicate rows with the same track_name, first_artist_ duration_ms combinations, 10904 rows are dropped and only 196 rows are left. And the resulting dataset starts from the 8700th row. Where do I go wrong?

import pandas as pd
data = pd.read_csv('data 1.csv')

# Number of rows before dropping duplicates
print(len(data)) # 11000

# Number of duplicate rows
print(len(data.loc[data.duplicated(subset=['track_name', 'first_artist', 'duration_ms'])]['track_name'])) # 10904

# Dropping the duplicate tracks
data.drop_duplicates(subset=['track_name', 'first_artist', 'duration_ms'], keep='last', inplace=True)

# Number of unique rows
print(len(data)) # 196

Can you find and provide some examples of what you expect to remain but doesn't (with provided dataframes not a screenshot). I tested your code and it appeared to work for me.

data = {
    'Artist' : ['An Artist', 'Another Artist', 'Last Artist', 'An Artist'],
    'Track_Name' : ['A Track', 'Another Track', 'Last Track', 'A Track'],
    'Duration_MS' : [1000, 2000, 3000, 1000], 
    'Disliked_Artist' : ['A Disliked Artist', 'Another Disliked Artist', 'Last Disliked Artist', 'A Different Disliked Artist']
}
df = pd.DataFrame(data)
df.drop_duplicates(keep = 'last', subset=['Track_Name', 'Artist', 'Duration_MS'])

So more information might help resolve any doubt/questions you might have.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM