使用spark ml管道运行多类分类

Question

I just started using spark ML pipeline to implement a multiclass classifier using LogisticRegressionWithLBFGS (which accepts as a parameters number of classes) 我刚刚开始使用spark ML管道通过LogisticRegressionWithLBFGS（它接受作为参数的类数）来实现多类分类器

I followed this example: 我遵循以下示例：

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}       

case class LabeledDocument(id: Long, text: String, label: Double)
case class Document(id: Long, text: String)       

val conf = new SparkConf().setAppName("SimpleTextClassificationPipeline")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._       

// Prepare training documents, which are labeled.
val training = sc.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))        


// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))       


// Fit the pipeline to training documents.
val model = pipeline.fit(training.toDF)       

// Prepare test documents, which are unlabeled.
val test = sc.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m n"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))       

// Make predictions on test documents.
model.transform(test.toDF)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println("($id, $text) --> prob=$prob, prediction=$prediction")
      }       

sc.stop()

The problem is that the LogisticRegression class used by ML use by default 2 classes (line 176) : override val numClasses: Int = 2 问题是ML使用的LogisticRegression类默认情况下使用2个类（第176行）：覆盖val numClasses：Int = 2

Any idea how to solve this problem? 任何想法如何解决这个问题？

Thanks 谢谢

Answer 1

As Odomontois already mentioned, if you'd like to use basic NLP pipelines using Spark ML Pipelines you have only 2 options: 正如Odomontois已经提到的那样，如果您想通过Spark ML Pipelines使用基本的NLP管道，则只有两种选择：

One vs. Rest and pass existing LogisticRegression, ie new OneVsRest().setClassifier(logisticRegression) 一个与休息并传递现有的LogisticRegression，即new OneVsRest().setClassifier(logisticRegression)
Use bag of words ( CountVectorizer in terms of Spark) and NaiveBayes classifier that supports multiclass classification 使用文字包（ CountVectorizer Spark中的条款），并NaiveBayes支持多类分类分级

Answer 2

But your test samples only have 2 classes.. Why would it do otherwise in "auto" mode? 但是您的测试样本只有2个类。为什么在“自动”模式下会这样做？ You can force to have a multinomial classifer though: 但是，您可以强制使用多项式分类器：

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression

val family: Param[String]
Param for the name of family which is a description of the label distribution to be used in the model. Supported options:

"auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial"
"binomial": Binary logistic regression with pivoting.
"multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".

使用spark ml管道运行多类分类

问题描述

2 个解决方案

解决方案1
1 2016-10-11 15:47:02

解决方案2
0 2018-01-25 19:12:44

使用spark ml管道运行多类分类

问题描述

2 个解决方案

解决方案1 1 2016-10-11 15:47:02

解决方案2 0 2018-01-25 19:12:44

解决方案1
1 2016-10-11 15:47:02

解决方案2
0 2018-01-25 19:12:44