[英]Sampling by group in R without duplicates between groups
I have a 'data.frame' with two columns ID_1 and ID_2.我有一个包含两列 ID_1 和 ID_2 的“data.frame”。 I want to sample to ID_1, but there must be no duplicates.我想采样到ID_1,但是不能重复。 That is, I want to draw a sample within a group, but the selection within the group must be one that has not been previously selected in other ID_1 groups.也就是我想在一个组内抽取一个样本,但是组内的选择必须是之前在其他ID_1组中没有选择过的。
I have already consulted this question: Sampling by Group in R with no replacement but the final result cannot contain any repeats as well probably solves the problem, but when I work with the largest data set, an error occurs.我已经咨询过这个问题: Sampling by Group in R with no replacement but the final result cannot contain any repeats as so probably solves the problem,但是当我处理最大的数据集时,会发生错误。 I have tested the suggestion of @LMc.我测试了@LMc的建议。
Example with simulated data.模拟数据示例。
library(data.table)
library(dplyr)
# Creating artificial data
set.seed(1)
dt <- data.table(ID_2 = sample(10^2,1000,replace = T))
x <- c(sort(sample(1000-1,200,replace = F)),1000)
x2 <- x-lag(x)
x2[1] <- 1
id_1 <- rep(1:uniqueN(x),x2)
id_1 <- if (length(id_1) == 1000) id_1 else c(id_1,rep(id_1[length(id_1)],1000-length(id_1)))
dt[, ID_1 := id_1]
# testing LMs code
my_sample <- function(x, ...){
if (length(x) == 0L) return(NA) else sample(x, ...)
}
dt %>%
group_by(ID_1) %>%
slice_sample(n = 1) %>%
ungroup() %>%
mutate(resample = duplicated(ID_2)) %>%
rowwise() %>%
mutate(ID_2 = if (resample) my_sample(dt[dt$ID_1 == ID_1 & dt$ID_2 != ID_2, "ID_2"], 1) else ID_2) %>%
ungroup() %>%
select(-resample)
Visual idea with little data数据少的视觉创意
The suggestion is not clever, because if the value ID_2 assigned to ID_1 has already been assigned to an earlier ID_1, an error occurs.这个建议并不聪明,因为如果赋给 ID_1 的值 ID_2 已经赋给了更早的 ID_1,就会出错。 I include a small example where there is only a one-time solution, so you have to run the @LMc code several times until the assignment is correct.我包括了一个只有一次性解决方案的小示例,因此您必须多次运行 @LMc 代码,直到分配正确。
dt_joke <-
data.table(ID_2 = c(15L, 60L, 50L, 47L, 60L, 60L, 31L, 31, 22L),
ID_1 = c(1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L))
dt_joke
ID_2 ID_1
1: 15 1
2: 60 2
3: 50 2
4: 47 2
5: 60 3
6: 60 3
7: 31 3
8: 31 4
9: 22 4
Expected result预期结果
ID_2 ID_1
1: 15 1
2: 47 2
3: 60 3
4: 31 4
If you intend to sample ID_2
grouped by ID_1
without duplication of ID_2
between groups, then you're going to run into insufficient data: using this seed, there are 42 groups with single rows here, at least two of which share an ID_2
.如果您打算对按ID_1
分组的ID_2
进行采样,而不在组之间重复ID_2
,那么您将遇到不足的数据:使用此种子,这里有 42 个单行组,其中至少两个共享一个ID_2
。 This means either some values will be duplicated between groups or some groups will go unrepresented.这意味着某些值将在组之间重复,或者某些组将 go 未表示。 This compounds, since once you've sampled all the 1-row tables, the 2-row tables are also in a pinch, again likely to be under-sampled.这种化合物,因为一旦您对所有 1 行表进行了采样,2 行表也处于紧要关头,再次可能采样不足。
If you can accept this, then:如果你能接受这一点,那么:
head(dt)
# ID_2 ID_1
# <int> <int>
# 1: 68 1
# 2: 39 2
# 3: 1 2
# 4: 34 2
# 5: 87 2
# 6: 43 3
set.seed(42)
dtspl <- split(dt, dt$ID_1)
dtspl <- dtspl[order(sapply(dtspl, nrow))]
set.seed(42)
out <- Reduce(function(prev, this)
rbindlist(list(
prev,
this[!ID_2 %in% prev$ID_2,][,.SD[sample(nrow(.SD), size=1),]]
)),
dtspl)
out
# ID_2 ID_1
# <int> <int>
# 1: 68 1
# 2: 54 5
# 3: 42 9
# 4: 40 12
# 5: 89 13
# 6: 90 17
# 7: 51 19
# 8: 37 34
# 9: 100 40
# 10: 82 48
# ---
# 91: 69 117
# 92: 61 154
# 93: 83 16
# 94: 15 47
# 95: 16 199
# 96: 7 27
# 97: 39 171
# 98: 22 186
# 99: 76 21
# 100: 32 10
We have all ID_2
covered, but not all ID_1
s.我们涵盖了所有ID_2
,但并未涵盖所有ID_1
。
all(dt$ID_2 %in% out$ID_2)
# [1] TRUE
all(dt$ID_1 %in% out$ID_1)
# [1] FALSE
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.