Say I have a data set like this:
breakfast lunch dinner mood
----------- ---------- --------- ------
waffles sandwich chili good
sausages sandwich pasta good
yogurt salad stew bad
gruel salad pizza bad
gruel pizza pizza good
sausages pizza pasta good
waffles salad chili good
gruel soup pizza bad
waffles soup chili good
sausages salad pasta good
waffles pizza chili good
yogurt sandwich stew good
yogurt pizza stew good
sausages soup pasta good
gruel sandwich pizza good
yogurt soup waffles good
I want to predict a person's mood based on what they ate that day. So I'll do a 70/30 train/test split and use a random forest, SVM or something like that to build a classifier.
At least as I've used them in the past, the classifiers I have used complain if a predictor has a level in the test set that didn't appear in the training set. That might happen for the last row, where dinner
== "waffles".
To avoid this, I've usually dropped any rows with a level whose frequency is less than 10% in any column, before I do the split.
I suspect there may be a better way. I mainly code in R, but if you want to post an answer in Python, I'll probably be able to understand it.
thanks!
Now that I know the lingo, I found this post with an R use case: stratified splitting the data
Applied to my example, stratifying on both dinner and resulting mood:
library(splitstackshape)
library(readr)
meals_mood_text <- "breakfast lunch dinner mood
waffles sandwich chili good
sausages sandwich pasta good
yogurt soup waffles good
yogurt salad stew bad
gruel salad pizza bad
gruel pizza pizza good
sausages pizza pasta good
waffles salad chili good
gruel soup pizza bad
waffles soup chili good
sausages salad pasta good
waffles pizza chili good
yogurt sandwich stew good
yogurt pizza stew good
sausages soup pasta good
gruel sandwich pizza good"
meals_mood_frame <-
read.table(textConnection(meals_mood_text), header = TRUE)
closeAllConnections()
strat.res <- stratified(meals_mood_frame, c('dinner','mood'), 0.7, bothSets = TRUE)
print(strat.res[[1]])
print(strat.res[[2]])
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.