I have a data set with 500 observations. I like to generate 1s and 0s randomly based on two scenarios
Current Dataset
Id Age Category
1 23 1
2 24 1
3 21 2
. . .
. . .
. . .
500 27 3
Scenario 1
Scenario 2
Expected Output
Id Age Category Indicator
1 23 1 1
2 24 1 0
3 21 2 1
. . .
. . .
. . .
500 27 3 1
I know function sample(c(0,1), 500)
will generate 1s but I dont know how to make this generate 200 1s randomly. Also not sure how to generate 80 1s randomly in Category1, 80 1s in category2 and 40 1s in Category3.
Here's a full worked example.
Let's say your data looked like this:
set.seed(69)
df <- data.frame(id = 1:500,
Age = 20 + sample(10, 500, TRUE),
Category = sample(3, 500, TRUE))
head(df)
#> id Age Category
#> 1 1 21 2
#> 2 2 22 2
#> 3 3 28 3
#> 4 4 27 2
#> 5 5 27 1
#> 6 6 26 2
Now, you didn't mention how many of each category you had, so let's check how many there are in our sample:
table(df$Category)
#> 1 2 3
#> 153 179 168
Scenario 1 is straightforward. You need to create a vector of 500 zeros, then write a one into a sample 200 of the indexes of your new vector:
df$label <- numeric(nrow(df))
df$label[sample(nrow(df), 200)] <- 1
head(df)
#> id Age Category label
#> 1 1 21 2 1
#> 2 2 22 2 1
#> 3 3 28 3 0
#> 4 4 27 2 0
#> 5 5 27 1 0
#> 6 6 26 2 1
So we have random zeros and ones, but when we count them, we have:
table(df$label)
#>
#> 0 1
#> 300 200
Scenario 2 is similar but a bit more involved, because we need to perform a similar operation groupwise by category:
df$label <- numeric(nrow(df))
df <- do.call("rbind", lapply(split(df, df$Category), function(d) {
n_ones <- round(nrow(d) * 0.4 / ((d$Category[1] %/% 3) + 1))
d$label[sample(nrow(d), n_ones)] <- 1
d
}))
head(df)
#> id Age Category label
#> 1.5 5 27 1 0
#> 1.10 10 24 1 0
#> 1.13 13 23 1 1
#> 1.19 19 24 1 0
#> 1.26 26 22 1 1
#> 1.27 27 24 1 1
Now, since the number in each category is not nicely divisible by 10, we cannot get exactly 40% and 20% (though you might with your own data), but we get as close as possible to it, as the following demonstrates:
label_table <- table(df$Category, df$label)
label_table
#> 0 1
#> 1 92 61
#> 2 107 72
#> 3 134 34
apply(label_table, 1, function(x) x[2]/sum(x))
#> 1 2 3
#> 0.3986928 0.4022346 0.2023810
Created on 2020-08-12 by the reprex package (v0.3.0)
Another way to fill random values is to create a vector of possible values (80 values of 1, and nrow-80 values of 0) and then sample from those possible values. This can use a bit more memory than setting values by indexing, but a vector of potential values is so small that it is generally trivial.
set.seed(42)
df <- data.frame(id = 1:500,
Age = 20 + sample(10, 500, TRUE),
Category = sample(3, 500, TRUE))
## In Tidyverse
library(tidyverse)
set.seed(42)
df2 <- df %>%
group_by(Category) %>%
mutate(Label = case_when(
Category == 1 ~ sample(
c(rep(1,80),rep(0,n()-80)),
n()
),
Category == 2 ~ sample(
c(rep(1,80),rep(0,n()-80)),
n()
),
Category == 3 ~ sample(
c(rep(1,40),rep(0,n()-40)),
n()
)
))
table(df2$Category,df2$Label)
# 0 1
# 1 93 80
# 2 82 80
# 3 125 40
## In base
df3 <- df
df3[df$Category == 1,"Label"] <- sample(
c(rep(1,80),rep(0,nrow(df[df$Category == 1,])-80)),
nrow(df[df$Category == 1,])
)
df3[df$Category == 2,"Label"] <- sample(
c(rep(1,80),rep(0,nrow(df[df$Category == 2,])-80)),
nrow(df[df$Category == 2,])
)
df3[df$Category == 3,"Label"] <- sample(
c(rep(1,40),rep(0,nrow(df[df$Category == 3,])-40)),
nrow(df[df$Category == 3,])
)
table(df3$Category,df3$Label)
# 0 1
# 1 93 80
# 2 82 80
# 3 125 40
To solve scenario 1, you'll need to create a vector with 300 zeroes and 200 ones and then same from that without replacement.
pull_from = c(rep(0,300), rep(1,200))
sample(pull_from, replace = FALSE)
For scenario 2, I suggest breaking your data into 3 separate chunks based on category, repeating the above step with different values for the numbers of zeroes and ones you need and then recombining into one dataframe.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.