Caret - creating stratified data sets based on several variables

Question

In the R package caret, can we create stratified training and test sets based on several variables using the function createDataPartition() (or createFolds() for cross-validation)?

Here is an example for one variable:

#2/3rds for training
library(caret)
inTrain = createDataPartition(df$yourFactor, p = 2/3, list = FALSE)
dfTrain=df[inTrain,]
dfTest=df[-inTrain,]

In the code above the training and test sets are stratified by 'df$yourFactor'. But is it possible to stratify using several variables (eg 'df$yourFactor' and 'df$yourFactor2')? The following code seems to work but I don't know if it is correct:

inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)

Answer 1

This is fairly simple if you use the tidyverse .

For example:

df <- df %>%
  mutate(n = row_number()) %>% #create row number if you dont have one
  select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
  group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
  sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'

Answer 2

There is a better way to do this.

set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE), 
                yourFactor2 = rbinom(n,1,.5),
                yourFactor3 = rbinom(n,1,.7))

stratum indicator

d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])

sample selection

indices <- tapply(1:nrow(d), d$group, sample, 30 )

obtain subsample

subsampd <- d[unlist(indices, use.names = FALSE), ]

what this does is make a size 30 random stratified sample on every combination of yourFactor and yourFactor2 .

Caret - creating stratified data sets based on several variables

Question

2 answers

solution1
2 2019-03-23 14:20:32

solution2
0 ACCPTED 2019-02-07 06:31:34

stratum indicator

sample selection

obtain subsample

Caret - creating stratified data sets based on several variables

Question

2 answers

solution1 2 2019-03-23 14:20:32

solution2 0 ACCPTED 2019-02-07 06:31:34

stratum indicator

sample selection

obtain subsample

solution1
2 2019-03-23 14:20:32

solution2
0 ACCPTED 2019-02-07 06:31:34