简体   繁体   English

在 R 中按组抽样,组间不重复

[英]Sampling by group in R without duplicates between groups

I have a 'data.frame' with two columns ID_1 and ID_2.我有一个包含两列 ID_1 和 ID_2 的“data.frame”。 I want to sample to ID_1, but there must be no duplicates.我想采样到ID_1,但是不能重复。 That is, I want to draw a sample within a group, but the selection within the group must be one that has not been previously selected in other ID_1 groups.也就是我想在一个组内抽取一个样本,但是组内的选择必须是之前在其他ID_1组中没有选择过的。

I have already consulted this question: Sampling by Group in R with no replacement but the final result cannot contain any repeats as well probably solves the problem, but when I work with the largest data set, an error occurs.我已经咨询过这个问题: Sampling by Group in R with no replacement but the final result cannot contain any repeats as so probably solves the problem,但是当我处理最大的数据集时,会发生错误。 I have tested the suggestion of @LMc.我测试了@LMc的建议。

Example with simulated data.模拟数据示例。

library(data.table)
library(dplyr)
# Creating artificial data
set.seed(1)
dt <- data.table(ID_2 = sample(10^2,1000,replace = T))   
x <- c(sort(sample(1000-1,200,replace = F)),1000)
x2 <- x-lag(x)
x2[1] <- 1
id_1 <- rep(1:uniqueN(x),x2)
id_1 <- if (length(id_1) == 1000) id_1 else c(id_1,rep(id_1[length(id_1)],1000-length(id_1)))
dt[, ID_1 := id_1]

# testing LMs code 
my_sample <- function(x, ...){
  if (length(x) == 0L) return(NA) else sample(x, ...)
}

dt %>% 
  group_by(ID_1) %>% 
  slice_sample(n = 1) %>% 
  ungroup() %>%
  mutate(resample = duplicated(ID_2)) %>% 
  rowwise() %>%
  mutate(ID_2 = if (resample) my_sample(dt[dt$ID_1 == ID_1 & dt$ID_2 != ID_2, "ID_2"], 1) else ID_2) %>% 
  ungroup() %>% 
  select(-resample)

Visual idea with little data数据少的视觉创意

The suggestion is not clever, because if the value ID_2 assigned to ID_1 has already been assigned to an earlier ID_1, an error occurs.这个建议并不聪明,因为如果赋给 ID_1 的值 ID_2 已经赋给了更早的 ID_1,就会出错。 I include a small example where there is only a one-time solution, so you have to run the @LMc code several times until the assignment is correct.我包括了一个只有一次性解决方案的小示例,因此您必须多次运行 @LMc 代码,直到分配正确。

dt_joke <- 
  data.table(ID_2 = c(15L, 60L, 50L, 47L, 60L, 60L, 31L, 31, 22L),
           ID_1 = c(1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L))
dt_joke 
   ID_2 ID_1
1:   15    1
2:   60    2
3:   50    2
4:   47    2
5:   60    3
6:   60    3
7:   31    3
8:   31    4
9:   22    4

Expected result预期结果

   ID_2 ID_1
1:   15    1
2:   47    2
3:   60    3
4:   31    4

If you intend to sample ID_2 grouped by ID_1 without duplication of ID_2 between groups, then you're going to run into insufficient data: using this seed, there are 42 groups with single rows here, at least two of which share an ID_2 .如果您打算对按ID_1分组的ID_2进行采样,而不在组之间重复ID_2 ,那么您将遇到不足的数据:使用此种子,这里有 42 个单行组,其中至少两个共享一个ID_2 This means either some values will be duplicated between groups or some groups will go unrepresented.这意味着某些值将在组之间重复,或者某些组将 go 未表示。 This compounds, since once you've sampled all the 1-row tables, the 2-row tables are also in a pinch, again likely to be under-sampled.这种化合物,因为一旦您对所有 1 行表进行了采样,2 行表也处于紧要关头,再次可能采样不足。

If you can accept this, then:如果你能接受这一点,那么:

head(dt)
#     ID_2  ID_1
#    <int> <int>
# 1:    68     1
# 2:    39     2
# 3:     1     2
# 4:    34     2
# 5:    87     2
# 6:    43     3

set.seed(42)
dtspl <- split(dt, dt$ID_1)
dtspl <- dtspl[order(sapply(dtspl, nrow))]
set.seed(42)
out <- Reduce(function(prev, this)
                rbindlist(list(
                  prev,
                  this[!ID_2 %in% prev$ID_2,][,.SD[sample(nrow(.SD), size=1),]]
                )),
              dtspl)
out
#       ID_2  ID_1
#      <int> <int>
#   1:    68     1
#   2:    54     5
#   3:    42     9
#   4:    40    12
#   5:    89    13
#   6:    90    17
#   7:    51    19
#   8:    37    34
#   9:   100    40
#  10:    82    48
#  ---            
#  91:    69   117
#  92:    61   154
#  93:    83    16
#  94:    15    47
#  95:    16   199
#  96:     7    27
#  97:    39   171
#  98:    22   186
#  99:    76    21
# 100:    32    10

We have all ID_2 covered, but not all ID_1 s.我们涵盖了所有ID_2 ,但并未涵盖所有ID_1

all(dt$ID_2 %in% out$ID_2)
# [1] TRUE
all(dt$ID_1 %in% out$ID_1)
# [1] FALSE

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM