简体   繁体   中英

Remove duplicates in pandas. copy() and drop_duplicates() is removing rows that appear only once

As the question states. I am trying to get rid of duplicate rows in a df with 2 series/columns df['Offering Family', 'Major Offering'] .

I hope to merge the subsequent df with another one I have based on the Major Offering column, thus only the offering family column will be transposed to the new df. I should note that I only want to get rid of rows with values that are repeated in both columns. If a value appears more than once in the Offering family column but the value in the major offering column is different, it should not be deleted. However, when I run the code below, I'm finding that I'm losing those sorts of values. Can anybody help?

df = pd.read_excel(pipelineEx, sheet_name='Data')

dfMO = df[['Offering Family', 'Major Offering']].copy()

dfMO.filter(['Offering Family', 'Major Offering'])

dfMO = df.drop_duplicates(subset=None, keep="first", inplace=False)


#dfMO.drop_duplicates(keep=False,inplace=True)
print(dfMO)

dfMO.to_excel("Major Offering.xlsx")

Well there are a few things that are odd with the code you've shared.

Primarily, you created a dfM0 as a copy of df with only the two columns. But then you're applying the drop_duplicates() function on df , the original dataframe, and over-writing the dfM0 you created.

From what I understand, what you need is the dataframe to retain all unique combinations that could be made from values in the two columns. groupby() would be better suited for your purposes.

Try this:

cols = ['Offering Family', 'Major Offering']
dfM0 = df[cols].groupby(cols).count().reset_index()

reset_index() will return a copy, by default, so no additional keyword arguments are necessary.

I have upated your code and as Aditya Chhabra mentioned, you are creating a copy and not using it.

df = pd.read_excel(pipelineEx, sheet_name='Data')

dfMO = df[['Offering Family', 'Major Offering']].copy()
dfMO.drop_duplicates(inplace=True)
print(dfMO)

dfMO.to_excel("Major Offering.xlsx")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM