简体   繁体   中英

Keep X% last rows by group in Pandas

It's straightforward to keep the last N rows for every group in a dataframe with something like df.groupby('ID').tail(N) .

In my case, groups have different sizes and I would like to keep the same % of each group rather than same number of rows.

eg if we want to keep the last 50% rows for each group (based on ID) for the following:

df = pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
'value' : [1,2,10,11,12,13,14,15]})

The result would be:

 pd.DataFrame({'ID' : ['A','A','B','B','B','B','B','B'], 
    'value' : [2,13,14,15]})

How can we get to that?

EDIT: If x% is not an int, we round to the smallest closer int.

groupby - apply - tail

Pass the desired size to tail() in a GroupBy.apply() . This is simpler than the iloc method below since it cleanly handles the "last 0 rows" case.

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15
ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x.tail(int(ratio * len(x))))
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

groupby - apply - iloc

Alternatively, index the desired size via iloc /slicing, but this is clunkier since [-0:] does not actually get the last 0 rows, so we have to check against that:

ratio = 0.6
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  A      2
# 1  B     13
# 2  B     14
# 3  B     15
ratio = 0.4
(df.groupby('ID')
   .apply(lambda x: x[-int(ratio * len(x)):] if int(ratio * len(x)) else None)
   .reset_index(drop=True))

#   ID  value
# 0  B     14
# 1  B     15

Like commented, there is no built-in option to do so. You can do something like:

groups = df.groupby('ID')

enums = groups.cumcount().add(1)
sizes = groups['ID'].transform('size')

df[enums/sizes > 0.5]

Output:

  ID  value
1  A      2
5  B     13
6  B     14
7  B     15

Let us try two steps

s = df.groupby('ID').size()
out = df.groupby('ID').apply(lambda x : x.tail(s.loc[x.name]//2)).reset_index(drop=True)
out
Out[210]: 
  ID  value
0  A      2
1  B     13
2  B     14
3  B     15

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM