In the R package caret, can we create stratified training and test sets based on several variables using the function createDataPartition() (or createFolds() for cross-validation)?
Here is an example for one variable:
#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]
In the code above the training and test sets are stratified by 'df$yourFactor'. But is it possible to stratify using several variables (eg 'df$yourFactor' and 'df$yourFactor2')? The following code seems to work but I don't know if it is correct:
inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)
This is fairly simple if you use the tidyverse
.
For example:
df <- df %>%
mutate(n = row_number()) %>% #create row number if you dont have one
select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'
There is a better way to do this.
set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE),
yourFactor2 = rbinom(n,1,.5),
yourFactor3 = rbinom(n,1,.7))
d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])
indices <- tapply(1:nrow(d), d$group, sample, 30 )
subsampd <- d[unlist(indices, use.names = FALSE), ]
what this does is make a size 30 random stratified sample on every combination of yourFactor
and yourFactor2
.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.