简体   繁体   中英

run multiclass classification using spark ml pipeline

I just started using spark ML pipeline to implement a multiclass classifier using LogisticRegressionWithLBFGS (which accepts as a parameters number of classes)

I followed this example:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}       

case class LabeledDocument(id: Long, text: String, label: Double)
case class Document(id: Long, text: String)       

val conf = new SparkConf().setAppName("SimpleTextClassificationPipeline")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._       

// Prepare training documents, which are labeled.
val training = sc.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))        


// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))       


// Fit the pipeline to training documents.
val model = pipeline.fit(training.toDF)       

// Prepare test documents, which are unlabeled.
val test = sc.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m n"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))       

// Make predictions on test documents.
model.transform(test.toDF)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println("($id, $text) --> prob=$prob, prediction=$prediction")
      }       

sc.stop()

The problem is that the LogisticRegression class used by ML use by default 2 classes (line 176) : override val numClasses: Int = 2

Any idea how to solve this problem?

Thanks

As Odomontois already mentioned, if you'd like to use basic NLP pipelines using Spark ML Pipelines you have only 2 options:

  • One vs. Rest and pass existing LogisticRegression, ie new OneVsRest().setClassifier(logisticRegression)
  • Use bag of words ( CountVectorizer in terms of Spark) and NaiveBayes classifier that supports multiclass classification

But your test samples only have 2 classes.. Why would it do otherwise in "auto" mode? You can force to have a multinomial classifer though:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression

val family: Param[String]
Param for the name of family which is a description of the label distribution to be used in the model. Supported options:

"auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial"
"binomial": Binary logistic regression with pivoting.
"multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM