简体   繁体   中英

General strategy for dealing with rare factor levels in classification?

Say I have a data set like this:

  breakfast    lunch     dinner    mood  
 ----------- ---------- --------- ------ 
  waffles     sandwich   chili     good  
  sausages    sandwich   pasta     good  
  yogurt      salad      stew      bad   
  gruel       salad      pizza     bad   
  gruel       pizza      pizza     good  
  sausages    pizza      pasta     good  
  waffles     salad      chili     good  
  gruel       soup       pizza     bad   
  waffles     soup       chili     good  
  sausages    salad      pasta     good  
  waffles     pizza      chili     good  
  yogurt      sandwich   stew      good  
  yogurt      pizza      stew      good  
  sausages    soup       pasta     good  
  gruel       sandwich   pizza     good  
  yogurt      soup       waffles   good  

I want to predict a person's mood based on what they ate that day. So I'll do a 70/30 train/test split and use a random forest, SVM or something like that to build a classifier.

At least as I've used them in the past, the classifiers I have used complain if a predictor has a level in the test set that didn't appear in the training set. That might happen for the last row, where dinner == "waffles".

To avoid this, I've usually dropped any rows with a level whose frequency is less than 10% in any column, before I do the split.

I suspect there may be a better way. I mainly code in R, but if you want to post an answer in Python, I'll probably be able to understand it.

thanks!

Now that I know the lingo, I found this post with an R use case: stratified splitting the data

Applied to my example, stratifying on both dinner and resulting mood:

library(splitstackshape)
library(readr)

meals_mood_text <- "breakfast   lunch   dinner  mood
waffles sandwich    chili   good
sausages    sandwich    pasta   good
yogurt  soup    waffles good
yogurt  salad   stew    bad
gruel   salad   pizza   bad
gruel   pizza   pizza   good
sausages    pizza   pasta   good
waffles salad   chili   good
gruel   soup    pizza   bad
waffles soup    chili   good
sausages    salad   pasta   good
waffles pizza   chili   good
yogurt  sandwich    stew    good
yogurt  pizza   stew    good
sausages    soup    pasta   good
gruel   sandwich    pizza   good"

meals_mood_frame <-
  read.table(textConnection(meals_mood_text), header = TRUE)
closeAllConnections()

strat.res <- stratified(meals_mood_frame, c('dinner','mood'), 0.7, bothSets = TRUE)

print(strat.res[[1]])

print(strat.res[[2]])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM