简体   繁体   中英

Pandas How do I create a list of duplicates from one column, and only keep the highest value for the corresponding columns?

I want to find all of the duplicates in the first column Primary Mod Site and only keep the highest value for all of the compounds (columns BM) in the dataset. excel sheet

For code, I have:

#read desired excel file
df = pd.read_excel("20220825_CISLIB01_Plate-1_Rows-A-B")

#function to find the duplicates in the dataset, sections them, and remove them
#can be applied to any dataset with the same format as original excel files

def getDuplicate():
    gene = df["Primary Mod Site"]
    #creates a list of all of the duplicates in Primary Mod Site
    pd.concat(g for _, g in df.groupby("gene") if len(g) > 1)

Im stuck on what to do next. Help much appreciated!

it helps if you post the data as code or text, to allow to reproduce.

but, IIUC, you need to groupby the column 'A' and then take the max from rest of the columns, this seems to do the trick

df["Primary Mod Site"].max()

Based on what i noticed in the screenshot (3 first rows for example), the row with the highest values tends to have the highest value in all columns, sooo, something like this might work.

 df = df.sort_values("ONCV-1-1-1", ascending = False).drop_duplicates("Primary Mod Site", keep='first', ignore_index=True)

or if not sure if that observation is correct for all rows.

probably this would work

df = df.groupby("Primary Mod Site").max()

NB: please post a reproducible example, easy to copy paste for us to test.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM