I have a dataframe as follows:
df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'b'],
"value": [20, 17,15, 10, 8 , 18, 18, 17, 13, 10]})
Notice that the dataframe is sorted in descending order by user_id then value.
For each user_id, I would like to remove the 2nd and 4th row so the output would look like
df = pd.DataFrame({"user_id": ['a', 'a', 'a', 'b', 'b', 'b',],
"value": [20, 15, 8 , 18, 17, 10]})
Inspired by drop first and last row from within each group , I tried the following:
def drop_rows(dataframe) :
pos = [1,3]
return dataframe.drop(dataframe.index[pos], inplace=True)
df.groupby('user_id').apply(drop_rows)
But got this "index 2 is out of bounds for axis 0 with size 0"
Could someone explain why this doesn't work and how I should proceed instead? Also, given that the dataset is quite huge, an efficient approach to the solution would be helpful. Thanks a lot.
You can use groupby+cumcount
to get row count in each group then check if not the row is in the to_del
list
to_del = [2,4]
df[~df.groupby('user_id').cumcount().add(1).isin(to_del)]
user_id value
0 a 20
2 a 15
4 a 8
5 b 18
7 b 17
9 b 10
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.