![](/img/trans.png)
[英]How to output values of R variables in an inline LateX equation in R Markdown (i.e. dynamically updated)
[英]How to sample a binary output with a fixed number of events (i.e. 1) in R?
假設我有一個數據框,例如:
set.seed(123)
df <- data.frame(x=rbinom(100,1,0.9), y=rbinom(100,1,0.95))
我想要的是從df
中采樣一個子集df_sub
,其中x==1
和y==1
的行數等於 5,而不管df_sub
的總行數如下:
## index <- sample(1:nrow(df),..,replace = FALSE)
df_sub <- df[index,]
df_sub
x y
1 1 1
2 1 1
3 1 1
4 1 0
5 0 1
6 1 1
7 1 1
如您所見,在 df_sub 中, x==1
& y==1
的行數等於5
而總行數等於7
。 我想用x==1
& y==1
以固定數量5
對原始df
進行采樣,而不管df_sub
的實際行數。
我們可以將rep
與sample
一起使用
n_events <- 20
total_len <- 70
n_zero_events <- total_len - n_events
v1 <- sample(rep(c(1, 0), c(n_events, n_zero_events)))
> sum(v1)
[1] 20
A base R 單線使用sample
+ rep
+ replace
> sample(replace(rep(0, 100), 1:20, 1))
[1] 0 1 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
[38] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0
[75] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 0
對於更新后的問題,我們可以使用data.table
執行自連接以查找所有符合標准的往返索引。 然后我們從這些指數中取樣。
library(data.table)
subsample <- function(n = 1L, dt, agg) {
idx <- dt[
# set the row number and the cumulative count that meet the criterion
, `:=`(r = .I, z = cumsum(x*y))
][
# set the look-back for the self-join
, `:=`(z1 = z - agg, z2 = c(0, first(z, -1)))
][
# self-join
dt, on = .(z2 = z1), nomatch = 0
][
# get the row indices for each row pair that meets the criterion
, .(idx = .(r:i.r)), seq_along(r)
][[2]] # keep just the indices
# reset dt
dt[, 3:6 := NULL]
# check that all the indices are valid (can be deleted, since they always are)
if (any(vapply(idx, function(x) sum(rowSums(dt[x]) == 2L), integer(1)) != agg)) stop("invalid sample")
lapply(sample(idx, n, TRUE), function(x) dt[x])
}
set.seed(123)
df <- data.table(x = rbinom(30, 1, 0.8), y = rbinom(30, 1, 0.9))
df_sub <- subsample(10L, df, 5L)
df_sub[[1]]
#> x y
#> 1: 1 0
#> 2: 1 1
#> 3: 0 1
#> 4: 0 1
#> 5: 1 1
#> 6: 1 1
#> 7: 0 1
#> 8: 1 1
#> 9: 1 1
#> 10: 0 1
set.seed(123)
df <- data.table(x = rbinom(100,1,0.9), y = rbinom(100,1,0.95))
df_sub <- subsample(10L, df, 5L)
df_sub[[1]]
#> x y
#> 1: 1 1
#> 2: 1 1
#> 3: 0 1
#> 4: 1 1
#> 5: 1 0
#> 6: 1 1
#> 7: 1 1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.