简体   繁体   中英

dplyr: select first half (or given proportion) of each group

my need is simple: i have a data.frame with a grouping variable, like this:

library(dplyr)
proportion = 0.5; set.seed(1)
df = data.frame(id=1:6, name=c("a", "a", "b"), value=rnorm(6)) %>% arrange(name)

I want to keep only the first half of each group (when ordered by id ). (i'd like to work with a modifiable proportion instead of the half, like 0.65 because it's for data splitting in train/test purpose)

Many questions answer this but with a fix number of lines (using top_n() , here ) I don't know how to make it dependent on the size of each group, using dplyr . And I don't want sample_frac() because it would break the id order. However, I have come to a solution in 2 steps using a custom function:

myfunc = function(data, prop){head(data, nrow(data)*prop)}
splitted.data = split(df, df$name)
lapply(splitted.data, myfunc, prop=proportion) %>% bind_rows()
####   id name      value
#### 1  1    a -0.6264538
#### 2  2    a  0.1836433
#### 3  3    b -0.8356286

But can I do this with dplyr directly? Thanks

You can use n() which will give you the number of rows in the grouped df. It doesn't work inside top_n but it works inside filter and slice :

df %>% 
  group_by(name) %>% 
  filter(row_number() <= proportion * n())

or

df %>% 
  group_by(name) %>% 
  slice(seq(proportion * n()))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM