简体   繁体   中英

Sampling by group in R without duplicates between groups

I have a 'data.frame' with two columns ID_1 and ID_2. I want to sample to ID_1, but there must be no duplicates. That is, I want to draw a sample within a group, but the selection within the group must be one that has not been previously selected in other ID_1 groups.

I have already consulted this question: Sampling by Group in R with no replacement but the final result cannot contain any repeats as well probably solves the problem, but when I work with the largest data set, an error occurs. I have tested the suggestion of @LMc.

Example with simulated data.

library(data.table)
library(dplyr)
# Creating artificial data
set.seed(1)
dt <- data.table(ID_2 = sample(10^2,1000,replace = T))   
x <- c(sort(sample(1000-1,200,replace = F)),1000)
x2 <- x-lag(x)
x2[1] <- 1
id_1 <- rep(1:uniqueN(x),x2)
id_1 <- if (length(id_1) == 1000) id_1 else c(id_1,rep(id_1[length(id_1)],1000-length(id_1)))
dt[, ID_1 := id_1]

# testing LMs code 
my_sample <- function(x, ...){
  if (length(x) == 0L) return(NA) else sample(x, ...)
}

dt %>% 
  group_by(ID_1) %>% 
  slice_sample(n = 1) %>% 
  ungroup() %>%
  mutate(resample = duplicated(ID_2)) %>% 
  rowwise() %>%
  mutate(ID_2 = if (resample) my_sample(dt[dt$ID_1 == ID_1 & dt$ID_2 != ID_2, "ID_2"], 1) else ID_2) %>% 
  ungroup() %>% 
  select(-resample)

Visual idea with little data

The suggestion is not clever, because if the value ID_2 assigned to ID_1 has already been assigned to an earlier ID_1, an error occurs. I include a small example where there is only a one-time solution, so you have to run the @LMc code several times until the assignment is correct.

dt_joke <- 
  data.table(ID_2 = c(15L, 60L, 50L, 47L, 60L, 60L, 31L, 31, 22L),
           ID_1 = c(1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L))
dt_joke 
   ID_2 ID_1
1:   15    1
2:   60    2
3:   50    2
4:   47    2
5:   60    3
6:   60    3
7:   31    3
8:   31    4
9:   22    4

Expected result

   ID_2 ID_1
1:   15    1
2:   47    2
3:   60    3
4:   31    4

If you intend to sample ID_2 grouped by ID_1 without duplication of ID_2 between groups, then you're going to run into insufficient data: using this seed, there are 42 groups with single rows here, at least two of which share an ID_2 . This means either some values will be duplicated between groups or some groups will go unrepresented. This compounds, since once you've sampled all the 1-row tables, the 2-row tables are also in a pinch, again likely to be under-sampled.

If you can accept this, then:

head(dt)
#     ID_2  ID_1
#    <int> <int>
# 1:    68     1
# 2:    39     2
# 3:     1     2
# 4:    34     2
# 5:    87     2
# 6:    43     3

set.seed(42)
dtspl <- split(dt, dt$ID_1)
dtspl <- dtspl[order(sapply(dtspl, nrow))]
set.seed(42)
out <- Reduce(function(prev, this)
                rbindlist(list(
                  prev,
                  this[!ID_2 %in% prev$ID_2,][,.SD[sample(nrow(.SD), size=1),]]
                )),
              dtspl)
out
#       ID_2  ID_1
#      <int> <int>
#   1:    68     1
#   2:    54     5
#   3:    42     9
#   4:    40    12
#   5:    89    13
#   6:    90    17
#   7:    51    19
#   8:    37    34
#   9:   100    40
#  10:    82    48
#  ---            
#  91:    69   117
#  92:    61   154
#  93:    83    16
#  94:    15    47
#  95:    16   199
#  96:     7    27
#  97:    39   171
#  98:    22   186
#  99:    76    21
# 100:    32    10

We have all ID_2 covered, but not all ID_1 s.

all(dt$ID_2 %in% out$ID_2)
# [1] TRUE
all(dt$ID_1 %in% out$ID_1)
# [1] FALSE

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM