简体   繁体   中英

How to get 3 lists with no duplicates in a random sampling? (R)

I have done the first step:

  • how many persons have more than 1 point
  • how many persons have more than 3 points
  • how many persons have more than 6 points

My goal: I need to have random samples (with no duplicates of persons)

  • of 3 persons that have more than 1 point
  • of 3 persons that have more than 3 points
  • of 3 persons that have more than 6 points

My dataset looks like this:

id   person   points
201  rt99   NA
201  rt99   3
201  rt99   2
202  kt     4
202  kt     NA
202  kt     NA
203  rr     4
203  rr     NA
203  rr     NA
204  jk     2
204  jk     2
204  jk     NA
322  knm3   5
322  knm3   NA
322  knm3   3
343  kll2   2
343  kll2   1
343  kll2   5
344  kll    NA
344  kll    7
344  kll    1
345  nn     7
345  nn     NA
490  kk     1
490  kk     NA
490  kk     2
491  ww     1
491  ww     1
489  tt     1
489  tt     1
325  ll     1
325  ll     1
325  ll     NA

That is what I have already tried to code, here is an example of code for finding persons that have more than 1 point:

persons_filtered <- dataset %>%
group_by(person) %>%
dplyr::filter(sum(points, na.rm = T)>1) %>%
distinct(person) %>%
pull()
person_filtered
more_than_1 <- sample(person_filtered, size = 3)

Question: How to write this code better that I could have in the end 3 lists with unique persons. (I need to prevent to have same persons in the lists)

Here's a tidyverse solution, where the sampling in the three categories of interest is made at the same time.

library(tidyverse)
dataset %>%
  # Group by person
  group_by(person) %>%
  # Get points sum
  summarize(sum_points = sum(points, na.rm = T)) %>%
  # Classify the sum points into categories defined by breaks, (0-1], (1-3] ...
  # I used 100 as the last value so that all sum points between 6 and Inf get classified as (6-Inf]
  mutate(point_class = cut(sum_points, breaks = c(0,1,3,6,Inf))) %>%
  # ungroup
  ungroup() %>%
  # group by point class
  group_by(point_class) %>%
  # Sample 3 rows per point_class
  sample_n(size = 3) %>%
  # Eliminate the sum_points column
  select(-sum_points) %>%
  # If you need this data in lists you can nest the results in the sampled_data column
  nest(sampled_data= -point_class)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM