简体   繁体   中英

Python pandas drop_duplicates() inaccuracy

I am working on a project that consists of compiling some .tsv files and I am attempting to clean up one of the files and this is what I have so far.

The data file is far too large to paste the output into here so here are a couple photos explaining my current issue.

before running drop (trying to remove the duplicate tconst)

after running drop (removes way too many rows)


origin = pd.read_table('akas.tsv')

origin.drop(origin.columns[[1,2,5,6,7]], axis=1, inplace=True)
origin.columns = ['tconst','region','language']
origin.drop_duplicates(subset = 'tconst', keep = False, inplace = True) 
print(origin)

If you want to keep one record of each duplicate (instead of all duplicates) you should not use keep=False . Citing the documentation for drop_duplicates

keep: {'first', 'last', False}, default 'first' Determines which duplicates (if any) to keep.

first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

By specifying keep=False as you have you're instructing pandas to drop all rows that contain duplicates. If, instead, you specify keep="first" your dataframe will retain the first entry of any duplicates, and drop all of the rest (which is what it seems like you're expecting).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM