Keep X% last rows by group in Pandas

Question

It's straightforward to keep the last N rows for every group in a dataframe with something like df.groupby('ID').tail(N) .

In my case, groups have different sizes and I would like to keep the same % of each group rather than same number of rows.

eg if we want to keep the last 50% rows for each group (based on ID) for the following:

df = pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
'value' : [1,2,10,11,12,13,14,15]})

The result would be:

 pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
    'value' : [2,13,14,15]})

How can we get to that?

EDIT: If x% is not an int, we round to the smallest closer int.

Answer 1

`groupby` - `apply` - `tail`

Pass the desired size to tail() in a GroupBy.apply() . This is simpler than the iloc method below since it cleanly handles the "last 0 rows" case.

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15

ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

`groupby` - `apply` - `iloc`

Alternatively, index the desired size via iloc /slicing, but this is clunkier since [-0:] does not actually get the last 0 rows, so we have to check against that:

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15

ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

Answer 2

Like commented, there is no built-in option to do so. You can do something like:

groups = df.groupby('ID')

enums = groups.cumcount().add(1)
sizes = groups['ID'].transform('size')

df[enums/sizes > 0.5]

Output:

  ID  value
1  A      2
5  B     13
6  B     14
7  B     15

Answer 3

Let us try two steps

s = df.groupby('ID').size()
out = df.groupby('ID').apply(lambda x : x.tail(s.loc[x.name]//2)).reset_index(drop=True)
out
Out[210]: 
  ID  value
0  A      2
1  B     13
2  B     14
3  B     15

Keep X% last rows by group in Pandas

Question

3 answers

solution1
9 ACCPTED 2021-03-10 03:32:14

`groupby` - `apply` - `tail`

`groupby` - `apply` - `iloc`

solution2
8 2021-03-10 03:30:54

solution3
5 2021-03-10 03:45:08

Keep X% last rows by group in Pandas

Question

3 answers

solution1 9 ACCPTED 2021-03-10 03:32:14

groupby - apply - tail

groupby - apply - iloc

solution2 8 2021-03-10 03:30:54

solution3 5 2021-03-10 03:45:08

solution1
9 ACCPTED 2021-03-10 03:32:14

`groupby` - `apply` - `tail`

`groupby` - `apply` - `iloc`

solution2
8 2021-03-10 03:30:54

solution3
5 2021-03-10 03:45:08