简体   繁体   中英

Conditionally sampling with slice_sample in dplyr

I had difficulty finding the answer for this, so I figured I would make a new query. I am trying to figure out how to make a conditional random sample of a dataset. For simplicity, I have this data frame which has one variable, food, with three different levels: apple 1, apple 1, and banana. I'm considering a scenario where they are not so neatly distributed in the data frame and are more random, but this is what I have so far:

df <- data.frame(food = rep(c("apple.1",
                              "apple.2",
                              "banana"),
                            500))
head(df,10)

Which gives me this if printed:

      food
1  apple.1
2  apple.2
3   banana
4  apple.1
5  apple.2
6   banana
7  apple.1
8  apple.2
9   banana
10 apple.1

Now sampling without replacement from there is easy enough with slice_sample :

df %>% 
  slice_sample(n=10)

Which gives me what I need on that front:

     food
1  apple.1
2   banana
3  apple.1
4  apple.2
5  apple.2
6  apple.2
7   banana
8  apple.2
9   banana
10 apple.1

However, let's say apple.1 and apple.2 come in pairs from a store, and we only want to pick one apple from each pair. If we pick both apples, it becomes less random due to age effects, environmental factors related to packaging, etc. So what I would like to do is make a conditional sample, where if I randomly pick fruit from a theoretical fruit basket, I am only selecting bananas and one of each pair of apples. So what can I do to accomplish this in R?

Edit

I wasn't as specific in my question as I probably should have been. For my specific query, I also need a way to uniquely identify which pair each apple comes from. So if Apple 1 and Apple 2 both come from Basket 67, I would like a way to uniquely identify that so I can check for duplicates.

I have included this very simple version of a dataset I'm thinking of:

structure(list(Basket = c(1L, 1L, 2L, 3L, 3L, 4L, 5L, 5L, 6L, 
7L, 7L, 8L, 9L, 9L, 10L), Fruit = c("Apple.1", "Apple.2", "Banana", 
"Apple.1", "Apple.2", "Banana", "Apple.1", "Apple.2", "Banana", 
"Apple.1", "Apple.2", "Banana", "Apple.1", "Apple.1", "Banana"
)), class = "data.frame", row.names = c(NA, -15L))

Which looks like this:

   Basket   Fruit
1       1 Apple.1
2       1 Apple.2
3       2  Banana
4       3 Apple.1
5       3 Apple.2
6       4  Banana
7       5 Apple.1
8       5 Apple.2
9       6  Banana
10      7 Apple.1
11      7 Apple.2
12      8  Banana
13      9 Apple.1
14      9 Apple.1
15     10  Banana

You could at first sample the 10 baskets, and then draw one apple in each pair of apples.

set.seed(1)

df %>%
  filter(Basket %in% sample(unique(Basket), 5)) %>%
  group_by(Basket) %>%
  slice_sample(n = 1) %>%
  ungroup()

# # A tibble: 5 × 2
#   Basket Fruit  
#    <int> <chr>  
# 1      1 Apple.1
# 2      2 Banana 
# 3      4 Banana 
# 4      7 Apple.2
# 5      9 Apple.1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM