简体   繁体   中英

Combine random sample with replacement with values from original dataframe in R

I have a dataset that looks like the following:

group  y  x
1      2  0
1      3  0
1      1  0
2      3  1
2      4  1
2      3  1

In the actual dataset, there are 180 groups (though they're not numbered from 1-180). The value of x is either 0 or 1 and is the same within each group. The value of y differs for each individual observation.

I am trying to get a random sample with replacement from the group column. Then, I would like to find a way to combine this with the original data. For example, if I randomly sample the group 1, I would like the final dataset to include all 3 observations included in group 1. If I randomly sample group 1 twice, I would like the final dataset to include each observation from group 1 twice.

Here's an example. If I imagine I have randomly sample 1, 1, and 2, I would like the final dataset to look like this:

group  y  x
1      2  0
1      3  0
1      1  0
1      2  0
1      3  0
1      1  0
2      3  1
2      4  1
2      3  1 

When I sample like below, I get a list of values. I am not sure what to do next to get the results I am looking for.

clusters <- sample(df$group, 180, replace = TRUE)

In Excel, I would use vlookup() to do something like this.

Base R:

set.seed(42)
do.call(rbind, sample(split(dat, dat$group), size = 3, replace = TRUE))
#      group y x
# 2.4      2 3 1
# 2.5      2 4 1
# 2.6      2 3 1
# 2.41     2 3 1
# 2.51     2 4 1
# 2.61     2 3 1
# 1.1      1 2 0
# 1.2      1 3 0
# 1.3      1 1 0

(The row names are not pretty, but they are harmless and ignored by most tools.)

Generically, and piece-wise, we see:

dat_spl <- split(dat, dat$group)
inds <- c(1, 1, 2)
### randomly this can be done with:
# inds <- sample(length(dat_spl), size = 3, replace = TRUE)
do.call(rbind, dat_spl[inds])
#      group y x
# 1.1      1 2 0
# 1.2      1 3 0
# 1.3      1 1 0
# 1.11     1 2 0
# 1.21     1 3 0
# 1.31     1 1 0
# 2.4      2 3 1
# 2.5      2 4 1
# 2.6      2 3 1

If you want/need it to be pure-tidyverse, an alternative:

library(dplyr)
set.seed(42)
dat %>%
  group_by(group) %>%
  nest(dat = -group) %>%
  ungroup() %>%
  sample_n(3, replace = TRUE) %>%
  unnest(dat)
# # A tibble: 9 x 3
#   group     y     x
#   <int> <int> <int>
# 1     2     3     1
# 2     2     4     1
# 3     2     3     1
# 4     2     3     1
# 5     2     4     1
# 6     2     3     1
# 7     1     2     0
# 8     1     3     0
# 9     1     1     0

Data:

dat <- structure(list(group = c(1L, 1L, 1L, 2L, 2L, 2L), y = c(2L, 3L, 
1L, 3L, 4L, 3L), x = c(0L, 0L, 0L, 1L, 1L, 1L)), row.names = c(NA, 
-6L), class = "data.frame")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM