r 生成具有随机 1 和 0 的列，有限制

Question

I have a data set with 500 observations.我有一个包含 500 个观察值的数据集。 I like to generate 1s and 0s randomly based on two scenarios我喜欢根据两种情况随机生成 1 和 0

Current Dataset当前数据集

  Id     Age    Category   
  1      23     1
  2      24     1
  3      21     2
  .      .      .
  .      .      .
  .      .      .
500      27     3

Scenario 1方案 1

The total number of 1s should be 200 and they should be random. 1 的总数应该是 200，它们应该是随机的。 The remaining 300 should be 0s.剩下的 300 应该是 0。

Scenario 2方案 2

The total number of 1s should be 200. The remaining 300 should be 0s. 1 的总数应该是 200。剩下的 300 应该是 0。
- 40% of the 1s should be in Category1. 40% 的 1 应该属于 Category1。 That is 80 1s should be in Category1也就是说 80 个 1 应该在 Category1
- 40% of the 1s should be in Category2 That is 80 1s should be in Category2 40% 的 1 应该属于 Category2 即 80 个 1s 应该属于 Category2
- 20% of the 1s should be in Category3 That is 40 1s should be in Category3 20% 的 1 应该属于 Category3 即 40 个 1s 应该属于 Category3

Expected Output预计 Output

  Id     Age    Category  Indicator  
  1      23     1         1
  2      24     1         0
  3      21     2         1
  .      .      .
  .      .      .
  .      .      .
500      27     3         1

I know function sample(c(0,1), 500) will generate 1s but I dont know how to make this generate 200 1s randomly.我知道 function sample(c(0,1), 500)会生成 1s，但我不知道如何让它随机生成 200 个 1s。 Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.也不知道如何在 Category1 中随机生成 80 个 1，在 Category2 中随机生成 80 个 1，在 Category3 中生成 40 个 1。

Answer 1

Here's a full worked example.这是一个完整的工作示例。

Let's say your data looked like this:假设您的数据如下所示：

set.seed(69)

df <- data.frame(id = 1:500, 
                 Age = 20 + sample(10, 500, TRUE),
                 Category = sample(3, 500, TRUE))

head(df)
#>   id Age Category
#> 1  1  21        2
#> 2  2  22        2
#> 3  3  28        3
#> 4  4  27        2
#> 5  5  27        1
#> 6  6  26        2

Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:现在，您没有提到每个类别有多少，所以让我们检查一下我们的样本中有多少：

table(df$Category)

#>   1   2   3 
#> 153 179 168

Scenario 1 is straightforward.场景 1 很简单。 You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:您需要创建一个包含 500 个零的向量，然后将一个 1 写入新向量的 200 个索引样本中：

df$label <- numeric(nrow(df))
df$label[sample(nrow(df), 200)] <- 1

head(df)
#>   id Age Category label
#> 1  1  21        2     1
#> 2  2  22        2     1
#> 3  3  28        3     0
#> 4  4  27        2     0
#> 5  5  27        1     0
#> 6  6  26        2     1

So we have random zeros and ones, but when we count them, we have:所以我们有随机的零和一，但是当我们计算它们时，我们有：

table(df$label)
#> 
#>   0   1 
#> 300 200

Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:场景 2 类似，但涉及更多一点，因为我们需要按类别分组执行类似的操作：

df$label <- numeric(nrow(df))
df <- do.call("rbind", lapply(split(df, df$Category), function(d) {
  n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
  d$label[sample(nrow(d), n_ones)] <- 1 
  d
  }))

head(df)
#>      id Age Category label
#> 1.5   5  27        1     0
#> 1.10 10  24        1     0
#> 1.13 13  23        1     1
#> 1.19 19  24        1     0
#> 1.26 26  22        1     1
#> 1.27 27  24        1     1

Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:现在，由于每个类别中的数字不能很好地被 10 整除，我们无法准确地得到 40% 和 20%（尽管您可能使用自己的数据），但我们会尽可能接近它，如下所示：

label_table <- table(df$Category, df$label)
label_table   
#>       0   1
#>   1  92  61
#>   2 107  72
#>   3 134  34

apply(label_table, 1, function(x) x[2]/sum(x))
#>         1         2         3 
#> 0.3986928 0.4022346 0.2023810

^{Created on 2020-08-12 by the reprex package (v0.3.0)}^{由reprex package (v0.3.0) 于 2020 年 8 月 12 日创建}

Answer 2

Another way to fill random values is to create a vector of possible values (80 values of 1, and nrow-80 values of 0) and then sample from those possible values.另一种填充随机值的方法是创建一个可能值向量（80 个值为 1，nrow-80 个值为 0），然后从这些可能值中采样。 This can use a bit more memory than setting values by indexing, but a vector of potential values is so small that it is generally trivial.与通过索引设置值相比，这可以使用更多的 memory，但是潜在值的向量非常小，通常是微不足道的。

set.seed(42)

df <- data.frame(id = 1:500, 
                 Age = 20 + sample(10, 500, TRUE),
                 Category = sample(3, 500, TRUE))

## In Tidyverse

library(tidyverse)

set.seed(42)

df2 <- df %>%
  group_by(Category) %>%
  mutate(Label = case_when(
    Category == 1 ~ sample(
      c(rep(1,80),rep(0,n()-80)),
      n()
    ),
    Category == 2 ~ sample(
      c(rep(1,80),rep(0,n()-80)), 
      n()
    ),
    Category == 3 ~ sample(
      c(rep(1,40),rep(0,n()-40)), 
      n()
    )
  ))

table(df2$Category,df2$Label)

#     0   1
# 1  93  80
# 2  82  80
# 3 125  40

## In base

df3 <- df

df3[df$Category == 1,"Label"] <- sample(
  c(rep(1,80),rep(0,nrow(df[df$Category == 1,])-80)),
  nrow(df[df$Category == 1,])
)
df3[df$Category == 2,"Label"] <- sample(
  c(rep(1,80),rep(0,nrow(df[df$Category == 2,])-80)),
  nrow(df[df$Category == 2,])
)
df3[df$Category == 3,"Label"] <- sample(
  c(rep(1,40),rep(0,nrow(df[df$Category == 3,])-40)),
  nrow(df[df$Category == 3,])
)

table(df3$Category,df3$Label)

#     0   1
# 1  93  80
# 2  82  80
# 3 125  40

Answer 3

To solve scenario 1, you'll need to create a vector with 300 zeroes and 200 ones and then same from that without replacement.要解决方案 1，您需要创建一个包含 300 个零和 200 个 1 的向量，然后与该向量相同而无需替换。

pull_from = c(rep(0,300), rep(1,200))

sample(pull_from, replace = FALSE)

For scenario 2, I suggest breaking your data into 3 separate chunks based on category, repeating the above step with different values for the numbers of zeroes and ones you need and then recombining into one dataframe.对于场景 2，我建议根据类别将您的数据分成 3 个单独的块，重复上述步骤，为您需要的零和零的数量使用不同的值，然后重新组合成一个 dataframe。

r 生成具有随机 1 和 0 的列，有限制

问题描述

3 个解决方案

解决方案1
2 已采纳 2020-08-12 20:10:44

解决方案2
1 2020-08-12 21:30:04

解决方案3
0 2020-08-12 19:28:17

r 生成具有随机 1 和 0 的列，有限制

问题描述

3 个解决方案

解决方案1 2 已采纳 2020-08-12 20:10:44

解决方案2 1 2020-08-12 21:30:04

解决方案3 0 2020-08-12 19:28:17

解决方案1
2 已采纳 2020-08-12 20:10:44

解决方案2
1 2020-08-12 21:30:04

解决方案3
0 2020-08-12 19:28:17