使用spark ml管道運行多類分類

Question

我剛剛開始使用spark ML管道通過LogisticRegressionWithLBFGS（它接受作為參數的類數）來實現多類分類器

我遵循以下示例：

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}       

case class LabeledDocument(id: Long, text: String, label: Double)
case class Document(id: Long, text: String)       

val conf = new SparkConf().setAppName("SimpleTextClassificationPipeline")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._       

// Prepare training documents, which are labeled.
val training = sc.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))        


// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))       


// Fit the pipeline to training documents.
val model = pipeline.fit(training.toDF)       

// Prepare test documents, which are unlabeled.
val test = sc.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m n"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))       

// Make predictions on test documents.
model.transform(test.toDF)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println("($id, $text) --> prob=$prob, prediction=$prediction")
      }       

sc.stop()

問題是ML使用的LogisticRegression類默認情況下使用2個類（第176行）：覆蓋val numClasses：Int = 2

任何想法如何解決這個問題？

謝謝

Answer 1

正如Odomontois已經提到的那樣，如果您想通過Spark ML Pipelines使用基本的NLP管道，則只有兩種選擇：

一個與休息並傳遞現有的LogisticRegression，即new OneVsRest().setClassifier(logisticRegression)
使用文字包（ CountVectorizer Spark中的條款），並NaiveBayes支持多類分類分級

Answer 2

但是您的測試樣本只有2個類。為什么在“自動”模式下會這樣做？ 但是，您可以強制使用多項式分類器：

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression

val family: Param[String]
Param for the name of family which is a description of the label distribution to be used in the model. Supported options:

"auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial"
"binomial": Binary logistic regression with pivoting.
"multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".

使用spark ml管道運行多類分類

問題描述

2 個解決方案

解決方案1
1 2016-10-11 15:47:02

解決方案2
0 2018-01-25 19:12:44

使用spark ml管道運行多類分類

問題描述

2 個解決方案

解決方案1 1 2016-10-11 15:47:02

解決方案2 0 2018-01-25 19:12:44

解決方案1
1 2016-10-11 15:47:02

解決方案2
0 2018-01-25 19:12:44