简体   繁体   中英

Deleting large amount of data from pandas dataframe

I have highly unbalanced data (with binary labels, zeros are 96% of data, while ones are just 4%) to balance it I have decided to delete some rows with label zero. However by iterating over the whole dataframe program would take several hours to delete the rows using pandas.dataframe.drop() method. What is the most time efficient way to delete the data?

I have tried sorting the data and then just clearing out bunch of rows with label 0, but unfortunately I must not change the order of data.

I have selected indexes of rows with label 0 and chosen random indexes from that list to delete like so: drops = random.sample(zero_indexes, X) (where X is amount of rows I want to delete) but I am not sure how to delete rows with such indexes in acceptable time. Any help would be appreciated

Get a list of indices you want to chuck

bad_labels = df[df['label'] == 0].sample(500).index

Then filter df to rows not in there

df1 = df[~df.index.isin(bad_labels)]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM