I have a 'data.frame' with two columns ID_1 and ID_2. I want to sample to ID_1, but there must be no duplicates. That is, I want to draw a sample within a group, but the selection within the group must be one that has not been previously selected in other ID_1 groups.
I have already consulted this question: Sampling by Group in R with no replacement but the final result cannot contain any repeats as well probably solves the problem, but when I work with the largest data set, an error occurs. I have tested the suggestion of @LMc.
Example with simulated data.
library(data.table)
library(dplyr)
# Creating artificial data
set.seed(1)
dt <- data.table(ID_2 = sample(10^2,1000,replace = T))
x <- c(sort(sample(1000-1,200,replace = F)),1000)
x2 <- x-lag(x)
x2[1] <- 1
id_1 <- rep(1:uniqueN(x),x2)
id_1 <- if (length(id_1) == 1000) id_1 else c(id_1,rep(id_1[length(id_1)],1000-length(id_1)))
dt[, ID_1 := id_1]
# testing LMs code
my_sample <- function(x, ...){
if (length(x) == 0L) return(NA) else sample(x, ...)
}
dt %>%
group_by(ID_1) %>%
slice_sample(n = 1) %>%
ungroup() %>%
mutate(resample = duplicated(ID_2)) %>%
rowwise() %>%
mutate(ID_2 = if (resample) my_sample(dt[dt$ID_1 == ID_1 & dt$ID_2 != ID_2, "ID_2"], 1) else ID_2) %>%
ungroup() %>%
select(-resample)
Visual idea with little data
The suggestion is not clever, because if the value ID_2 assigned to ID_1 has already been assigned to an earlier ID_1, an error occurs. I include a small example where there is only a one-time solution, so you have to run the @LMc code several times until the assignment is correct.
dt_joke <-
data.table(ID_2 = c(15L, 60L, 50L, 47L, 60L, 60L, 31L, 31, 22L),
ID_1 = c(1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L))
dt_joke
ID_2 ID_1
1: 15 1
2: 60 2
3: 50 2
4: 47 2
5: 60 3
6: 60 3
7: 31 3
8: 31 4
9: 22 4
Expected result
ID_2 ID_1
1: 15 1
2: 47 2
3: 60 3
4: 31 4
If you intend to sample ID_2
grouped by ID_1
without duplication of ID_2
between groups, then you're going to run into insufficient data: using this seed, there are 42 groups with single rows here, at least two of which share an ID_2
. This means either some values will be duplicated between groups or some groups will go unrepresented. This compounds, since once you've sampled all the 1-row tables, the 2-row tables are also in a pinch, again likely to be under-sampled.
If you can accept this, then:
head(dt)
# ID_2 ID_1
# <int> <int>
# 1: 68 1
# 2: 39 2
# 3: 1 2
# 4: 34 2
# 5: 87 2
# 6: 43 3
set.seed(42)
dtspl <- split(dt, dt$ID_1)
dtspl <- dtspl[order(sapply(dtspl, nrow))]
set.seed(42)
out <- Reduce(function(prev, this)
rbindlist(list(
prev,
this[!ID_2 %in% prev$ID_2,][,.SD[sample(nrow(.SD), size=1),]]
)),
dtspl)
out
# ID_2 ID_1
# <int> <int>
# 1: 68 1
# 2: 54 5
# 3: 42 9
# 4: 40 12
# 5: 89 13
# 6: 90 17
# 7: 51 19
# 8: 37 34
# 9: 100 40
# 10: 82 48
# ---
# 91: 69 117
# 92: 61 154
# 93: 83 16
# 94: 15 47
# 95: 16 199
# 96: 7 27
# 97: 39 171
# 98: 22 186
# 99: 76 21
# 100: 32 10
We have all ID_2
covered, but not all ID_1
s.
all(dt$ID_2 %in% out$ID_2)
# [1] TRUE
all(dt$ID_1 %in% out$ID_1)
# [1] FALSE
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.