如何在 R 中對具有固定事件數（即 1）的二進制 output 進行采樣？

Question

假設我有一個數據框，例如：

set.seed(123)
df <- data.frame(x=rbinom(100,1,0.9), y=rbinom(100,1,0.95))

我想要的是從df中采樣一個子集df_sub ，其中x==1和y==1的行數等於 5，而不管df_sub的總行數如下：

## index <- sample(1:nrow(df),..,replace = FALSE)
df_sub <- df[index,]
df_sub
    x y
1   1 1
2   1 1
3   1 1
4   1 0
5   0 1
6   1 1
7   1 1

如您所見，在 df_sub 中， x==1 & y==1的行數等於5而總行數等於7 。 我想用x==1 & y==1以固定數量5對原始df進行采樣，而不管df_sub的實際行數。

Answer 1

我們可以將rep與sample一起使用

n_events <- 20
total_len <- 70
n_zero_events <- total_len - n_events
v1 <- sample(rep(c(1, 0), c(n_events, n_zero_events)))
> sum(v1)
[1] 20

Answer 2

A base R 單線使用sample + rep + replace

> sample(replace(rep(0, 100), 1:20, 1))
  [1] 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0

Answer 3

對於更新后的問題，我們可以使用data.table執行自連接以查找所有符合標准的往返索引。 然后我們從這些指數中取樣。

library(data.table)

subsample <- function(n = 1L, dt, agg) {
  idx <- dt[
    # set the row number and the cumulative count that meet the criterion
    , `:=`(r = .I, z = cumsum(x*y))
  ][
    # set the look-back for the self-join
    , `:=`(z1 = z - agg, z2 = c(0, first(z, -1)))
  ][
    # self-join
    dt, on = .(z2 = z1), nomatch = 0
  ][
    # get the row indices for each row pair that meets the criterion
    , .(idx = .(r:i.r)), seq_along(r)
  ][[2]] # keep just the indices
  # reset dt
  dt[, 3:6 := NULL]
  # check that all the indices are valid (can be deleted, since they always are)
  if (any(vapply(idx, function(x) sum(rowSums(dt[x]) == 2L), integer(1)) != agg)) stop("invalid sample")
  lapply(sample(idx, n, TRUE), function(x) dt[x])
}

set.seed(123)
df <- data.table(x = rbinom(30, 1, 0.8), y = rbinom(30, 1, 0.9))
df_sub <- subsample(10L, df, 5L)
df_sub[[1]]
#>     x y
#>  1: 1 0
#>  2: 1 1
#>  3: 0 1
#>  4: 0 1
#>  5: 1 1
#>  6: 1 1
#>  7: 0 1
#>  8: 1 1
#>  9: 1 1
#> 10: 0 1

set.seed(123)
df <- data.table(x = rbinom(100,1,0.9), y = rbinom(100,1,0.95))
df_sub <- subsample(10L, df, 5L)
df_sub[[1]]
#>    x y
#> 1: 1 1
#> 2: 1 1
#> 3: 0 1
#> 4: 1 1
#> 5: 1 0
#> 6: 1 1
#> 7: 1 1

如何在 R 中對具有固定事件數（即 1）的二進制 output 進行采樣？

問題描述

3 個解決方案

解決方案1
2 2022-08-02 21:10:15

解決方案2
1 2022-08-02 21:29:31

解決方案3
1 2022-08-03 11:47:58

如何在 R 中對具有固定事件數（即 1）的二進制 output 進行采樣？

問題描述

3 個解決方案

解決方案1 2 2022-08-02 21:10:15

解決方案2 1 2022-08-02 21:29:31

解決方案3 1 2022-08-03 11:47:58

解決方案1
2 2022-08-02 21:10:15

解決方案2
1 2022-08-02 21:29:31

解決方案3
1 2022-08-03 11:47:58