简体   繁体   中英

r generate a column with random 1s and 0s with restrictions

I have a data set with 500 observations. I like to generate 1s and 0s randomly based on two scenarios

Current Dataset

  Id     Age    Category   
  1      23     1
  2      24     1
  3      21     2
  .      .      .
  .      .      .
  .      .      .
500      27     3

Scenario 1

  • The total number of 1s should be 200 and they should be random. The remaining 300 should be 0s.

Scenario 2

  • The total number of 1s should be 200. The remaining 300 should be 0s.
    • 40% of the 1s should be in Category1. That is 80 1s should be in Category1
    • 40% of the 1s should be in Category2 That is 80 1s should be in Category2
    • 20% of the 1s should be in Category3 That is 40 1s should be in Category3

Expected Output

  Id     Age    Category  Indicator  
  1      23     1         1
  2      24     1         0
  3      21     2         1
  .      .      .
  .      .      .
  .      .      .
500      27     3         1

I know function sample(c(0,1), 500) will generate 1s but I dont know how to make this generate 200 1s randomly. Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.

Here's a full worked example.

Let's say your data looked like this:

set.seed(69)

df <- data.frame(id = 1:500, 
                 Age = 20 + sample(10, 500, TRUE),
                 Category = sample(3, 500, TRUE))

head(df)
#>   id Age Category
#> 1  1  21        2
#> 2  2  22        2
#> 3  3  28        3
#> 4  4  27        2
#> 5  5  27        1
#> 6  6  26        2

Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:

table(df$Category)

#>   1   2   3 
#> 153 179 168 

Scenario 1 is straightforward. You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:

df$label <- numeric(nrow(df))
df$label[sample(nrow(df), 200)] <- 1

head(df)
#>   id Age Category label
#> 1  1  21        2     1
#> 2  2  22        2     1
#> 3  3  28        3     0
#> 4  4  27        2     0
#> 5  5  27        1     0
#> 6  6  26        2     1

So we have random zeros and ones, but when we count them, we have:

table(df$label)
#> 
#>   0   1 
#> 300 200

Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:

df$label <- numeric(nrow(df))
df <- do.call("rbind", lapply(split(df, df$Category), function(d) {
  n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
  d$label[sample(nrow(d), n_ones)] <- 1 
  d
  }))

head(df)
#>      id Age Category label
#> 1.5   5  27        1     0
#> 1.10 10  24        1     0
#> 1.13 13  23        1     1
#> 1.19 19  24        1     0
#> 1.26 26  22        1     1
#> 1.27 27  24        1     1

Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:

label_table <- table(df$Category, df$label)
label_table   
#>       0   1
#>   1  92  61
#>   2 107  72
#>   3 134  34

apply(label_table, 1, function(x) x[2]/sum(x))
#>         1         2         3 
#> 0.3986928 0.4022346 0.2023810

Created on 2020-08-12 by the reprex package (v0.3.0)

Another way to fill random values is to create a vector of possible values (80 values of 1, and nrow-80 values of 0) and then sample from those possible values. This can use a bit more memory than setting values by indexing, but a vector of potential values is so small that it is generally trivial.

set.seed(42)

df <- data.frame(id = 1:500, 
                 Age = 20 + sample(10, 500, TRUE),
                 Category = sample(3, 500, TRUE))

## In Tidyverse

library(tidyverse)

set.seed(42)

df2 <- df %>%
  group_by(Category) %>%
  mutate(Label = case_when(
    Category == 1 ~ sample(
      c(rep(1,80),rep(0,n()-80)),
      n()
    ),
    Category == 2 ~ sample(
      c(rep(1,80),rep(0,n()-80)), 
      n()
    ),
    Category == 3 ~ sample(
      c(rep(1,40),rep(0,n()-40)), 
      n()
    )
  ))

table(df2$Category,df2$Label)

#     0   1
# 1  93  80
# 2  82  80
# 3 125  40

## In base

df3 <- df

df3[df$Category == 1,"Label"] <- sample(
  c(rep(1,80),rep(0,nrow(df[df$Category == 1,])-80)),
  nrow(df[df$Category == 1,])
)
df3[df$Category == 2,"Label"] <- sample(
  c(rep(1,80),rep(0,nrow(df[df$Category == 2,])-80)),
  nrow(df[df$Category == 2,])
)
df3[df$Category == 3,"Label"] <- sample(
  c(rep(1,40),rep(0,nrow(df[df$Category == 3,])-40)),
  nrow(df[df$Category == 3,])
)

table(df3$Category,df3$Label)

#     0   1
# 1  93  80
# 2  82  80
# 3 125  40

To solve scenario 1, you'll need to create a vector with 300 zeroes and 200 ones and then same from that without replacement.

pull_from = c(rep(0,300), rep(1,200))

sample(pull_from, replace = FALSE)

For scenario 2, I suggest breaking your data into 3 separate chunks based on category, repeating the above step with different values for the numbers of zeroes and ones you need and then recombining into one dataframe.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM