简体   繁体   中英

Keep only first five rows if group has more than five rows

I have a dataframe with USERID (thus we group by this), and other variables statuses and a date .

Some of these USERID's have more than 5 statuses, so we should keep only the 5 most recent ones,by date .

How should I code this, looks simple but I haven't manage to do so.

We can use data.table . Convert the 'data.frame' to 'data.table' ( setDT(df1) ), grouped by 'USERID', we order the 'date' in decreasing (Assuming that the 'date' column is Date class) and get the first 5 rows with head

library(data.table)
setDT(df1)[order(-date), head(.SD, 5), by=USERID]

Or as @Symbolix mentioned in the comments, we could also make use of .I to get the row index and later remove the NA rows for groups that don't have 5 rows)

 na.omit(setDT(df1)[df1[order(-date), .I[1:5], by= USERID]$V1 ])

data

set.seed(49)
df1 <- data.frame(USERID= sample(LETTERS[1:3], 12, 
  replace=TRUE), date= sample(seq(as.Date('2014-01-01'), 
  length.out=12, by = '1 day')))

If you're a fan of dplyr you can do

library(dplyr)

df %>%
  group_by(USERID) %>%
  arrange(-date) %>%
  slice(1:5) %>%
  ungroup

On 'large' data sets the data.table approach will likely be faster, but dplyr has a slightly easier syntax to get your head around at first (in my opinion).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM