Python pandas drop_duplicates() inaccuracy

Question

I am working on a project that consists of compiling some .tsv files and I am attempting to clean up one of the files and this is what I have so far.

The data file is far too large to paste the output into here so here are a couple photos explaining my current issue.

before running drop (trying to remove the duplicate tconst)

after running drop (removes way too many rows)


origin = pd.read_table('akas.tsv')

origin.drop(origin.columns[[1,2,5,6,7]], axis=1, inplace=True)
origin.columns = ['tconst','region','language']
origin.drop_duplicates(subset = 'tconst', keep = False, inplace = True) 
print(origin)

Answer 1

If you want to keep one record of each duplicate (instead of all duplicates) you should not use keep=False . Citing the documentation for drop_duplicates

keep: {'first', 'last', False}, default 'first' Determines which duplicates (if any) to keep.

first : Drop duplicates except for the first occurrence.
last : Drop duplicates except for the last occurrence.
False : Drop all duplicates.

By specifying keep=False as you have you're instructing pandas to drop all rows that contain duplicates. If, instead, you specify keep="first" your dataframe will retain the first entry of any duplicates, and drop all of the rest (which is what it seems like you're expecting).

Python pandas drop_duplicates() inaccuracy

Question

1 answers

solution1
1 ACCPTED 2020-10-31 22:47:39

Python pandas drop_duplicates() inaccuracy

Question

1 answers

solution1 1 ACCPTED 2020-10-31 22:47:39

solution1
1 ACCPTED 2020-10-31 22:47:39