簡體   English   中英

如何在 R 中對具有固定事件數(即 1)的二進制 output 進行采樣?

[英]How to sample a binary output with a fixed number of events (i.e. 1) in R?

假設我有一個數據框,例如:

set.seed(123)
df <- data.frame(x=rbinom(100,1,0.9), y=rbinom(100,1,0.95))

我想要的是從df中采樣一個子集df_sub ,其中x==1y==1的行數等於 5,而不管df_sub的總行數如下:

## index <- sample(1:nrow(df),..,replace = FALSE)
df_sub <- df[index,]
df_sub
    x y
1   1 1
2   1 1
3   1 1
4   1 0
5   0 1
6   1 1
7   1 1

如您所見,在 df_sub 中, x==1 & y==1的行數等於5而總行數等於7 我想用x==1 & y==1以固定數量5對原始df進行采樣,而不管df_sub的實際行數。

我們可以將repsample一起使用

n_events <- 20
total_len <- 70
n_zero_events <- total_len - n_events
v1 <- sample(rep(c(1, 0), c(n_events, n_zero_events)))
> sum(v1)
[1] 20

A base R 單線使用sample + rep + replace

> sample(replace(rep(0, 100), 1:20, 1))
  [1] 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
 [38] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
 [75] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0

對於更新后的問題,我們可以使用data.table執行自連接以查找所有符合標准的往返索引。 然后我們從這些指數中取樣。

library(data.table)

subsample <- function(n = 1L, dt, agg) {
  idx <- dt[
    # set the row number and the cumulative count that meet the criterion
    , `:=`(r = .I, z = cumsum(x*y))
  ][
    # set the look-back for the self-join
    , `:=`(z1 = z - agg, z2 = c(0, first(z, -1)))
  ][
    # self-join
    dt, on = .(z2 = z1), nomatch = 0
  ][
    # get the row indices for each row pair that meets the criterion
    , .(idx = .(r:i.r)), seq_along(r)
  ][[2]] # keep just the indices
  # reset dt
  dt[, 3:6 := NULL]
  # check that all the indices are valid (can be deleted, since they always are)
  if (any(vapply(idx, function(x) sum(rowSums(dt[x]) == 2L), integer(1)) != agg)) stop("invalid sample")
  lapply(sample(idx, n, TRUE), function(x) dt[x])
}

set.seed(123)
df <- data.table(x = rbinom(30, 1, 0.8), y = rbinom(30, 1, 0.9))
df_sub <- subsample(10L, df, 5L)
df_sub[[1]]
#>     x y
#>  1: 1 0
#>  2: 1 1
#>  3: 0 1
#>  4: 0 1
#>  5: 1 1
#>  6: 1 1
#>  7: 0 1
#>  8: 1 1
#>  9: 1 1
#> 10: 0 1

set.seed(123)
df <- data.table(x = rbinom(100,1,0.9), y = rbinom(100,1,0.95))
df_sub <- subsample(10L, df, 5L)
df_sub[[1]]
#>    x y
#> 1: 1 1
#> 2: 1 1
#> 3: 0 1
#> 4: 1 1
#> 5: 1 0
#> 6: 1 1
#> 7: 1 1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM