简体   繁体   中英

Remove specific set of rows from each group in a dataframe

I have a dataframe as follows:

df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
                   "value": [20, 17,15, 10, 8 , 18, 18, 17, 13, 10]})

Notice that the dataframe is sorted in descending order by user_id then value.

For each user_id, I would like to remove the 2nd and 4th row so the output would look like

df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'b', 'b', 'b',],
                   "value": [20, 15, 8 , 18, 17, 10]})

Inspired by drop first and last row from within each group , I tried the following:

def drop_rows(dataframe) : 
     pos = [1,3]
     return dataframe.drop(dataframe.index[pos], inplace=True)
df.groupby('user_id').apply(drop_rows)

But got this "index 2 is out of bounds for axis 0 with size 0"

Could someone explain why this doesn't work and how I should proceed instead? Also, given that the dataset is quite huge, an efficient approach to the solution would be helpful. Thanks a lot.

You can use groupby+cumcount to get row count in each group then check if not the row is in the to_del list

to_del = [2,4]
df[~df.groupby('user_id').cumcount().add(1).isin(to_del)]

  user_id  value
0       a     20
2       a     15
4       a      8
5       b     18
7       b     17
9       b     10

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM