I have a data frame with 2 grouping columns V1 and V2. I want to sample exactly n = 4 elements for each distinct value in V1 and make sure that a minimum of m = 1 of each distinct element in V2 is sampled.
library(tidyverse)
set.seed(1)
df = data.frame(
V1 = c(rep("A",6), rep("B",6)),
V2 = c("C","C","D","D","E","E","F","F","G","G","H","H"),
V3 = rnorm(12)
)
df
V1 V2 V3
1 A C -0.6264538
2 A C 0.1836433
3 A D -0.8356286
4 A D 1.5952808
5 A E 0.3295078
6 A E -0.8204684
7 B F 0.4874291
8 B F 0.7383247
9 B G 0.5757814
10 B G -0.3053884
11 B H 1.5117812
12 B H 0.3898432
My desired output is for example...
V1 V2 V3
1 A C -0.626
2 A D -0.836
3 A E -0.820
4 A E 0.329
5 B F 0.487
6 B G 0.576
7 B G -0.305
8 B H 0.390
I do not know how to generate this output. When I group by V1 and V2 I get n = 3 elements for each distinct value in V1.
df %>%
group_by(V1,V2) %>%
sample_n(1)
V1 V2 V3
1 A C -0.626
2 A D -0.836
3 A E -0.820
4 B F 0.487
5 B G 0.576
6 B H 0.390
The "splitstackshape" or "sampling" packages did not help.
Here is one approach:
library(dplyr)
nr <- 4
first_pass <- df %>% group_by(V1, V2) %>% sample_n(1) %>% ungroup
first_pass %>%
count(V1) %>%
mutate(n = nr - n) %>%
left_join(df, by = 'V1') %>%
group_by(V1) %>%
sample_n(first(n)) %>%
select(-n) %>%
bind_rows(first_pass) %>%
arrange(V1, V2)
# V1 V2 V3
# <chr> <chr> <dbl>
#1 A C 0.184
#2 A D -0.836
#3 A E -0.820
#4 A E -0.820
#5 B F 0.487
#6 B F 0.738
#7 B G -0.305
#8 B H 0.390
The logic is to first randomly select 1 row for each V1
and V2
. We then calculate for each V1
how many more rows do we need to get nr
rows and sample them randomly from each V1
and combine the final dataset.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.