简体   繁体   中英

Weighting class in machine learning task

I'm trying out a machine learning task (binary classification) using caret and was wondering if there is a way to incorporate information about "uncertain" class, or to weight the classes differently.

As an illustration, I've cut and paste some of the code from the caret homepage working with the Sonar dataset (placeholder code - could be anything):

library(mlbench)
testdat <- get(data(Sonar))
set.seed(946)
testdat$Source<-as.factor(sample(c(LETTERS[1:6],LETTERS[1:3]),nrow(testdat),replace = T))

yielding:

summary(testdat$Source)  
 A  B  C  D  E  F   
49 51 44 17 28 19   

after which I would continue with a typical train,tune, and test routine once I decide on a model.

What I've added here is another factor column of a source, or where the corresponding "Class" came from. As an arbitrary example, say these were 6 different people who made their designation of "Class" using slightly different methods and I want to put greater importance on A's classification method than B's but less than C's and so forth.

The actual data are something like this, where there are class imbalances, both among the true/false, M/R, or whatever class, and among these Sources. From the vignettes and examples I have found, at least the former I would address by using a metric like ROC during tuning, but as to how to even incorporate the latter, I'm not sure.

  • separating the original data by Source and cycling through the factor levels one at a time, using the current level to build a model and the rest of the data to test it

  • instead of classification, turn it into a hybrid classification/regression problem, where I use the ranks of the sources as what I want to model. If A is considered best, then an "A positive" would get a score of +6, "A negative", a score of -6 and so on. Then perform a regression fit on these values, ignoring the Class column.

Any thoughts? Every search I conduct on classes and weights seems to reference the class imbalance issue, but assumes that the classification itself is perfect (or a standard on which to model). Is it even inappropriate to try to incorporate that information and I should just include everything and ignore the source? A potential issue with the first plan is that the smaller sources account for around a few hundred instances, versus over 10,000 for the larger sources, so I might also be concerned that a model built on a smaller set wouldn't generalize as well as one based on more data. Any thoughts would be appreciated.

There is no difference between weighting "because of importance" and weighting "because imbalance". These are exactly the same settings, they both refer to "how strongly should I penalize model for missclassifing sample from a particular class". Thus you do not need any regression (and should not do so! this is perfectly well stated classification problem, and you are simply overthinking it) but just providing samples weights, thats all. There are many models in caret accepting this kind of setting, including glmnet, glm, cforest etc. if you want to use svm you should change package (as ksvm does not support such things) for example to https://cran.r-project.org/web/packages/gmum.r/gmum.r.pdf (for sample or class weighting) or https://cran.r-project.org/web/packages/e1071/e1071.pdf (if it is class weighting)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM