简体   繁体   English

使用spark ml管道运行多类分类

[英]run multiclass classification using spark ml pipeline

I just started using spark ML pipeline to implement a multiclass classifier using LogisticRegressionWithLBFGS (which accepts as a parameters number of classes) 我刚刚开始使用spark ML管道通过LogisticRegressionWithLBFGS(它接受作为参数的类数)来实现多类分类器

I followed this example: 我遵循以下示例:

import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.{HashingTF, Tokenizer}
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.sql.{Row, SQLContext}       

case class LabeledDocument(id: Long, text: String, label: Double)
case class Document(id: Long, text: String)       

val conf = new SparkConf().setAppName("SimpleTextClassificationPipeline")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._       

// Prepare training documents, which are labeled.
val training = sc.parallelize(Seq(
      LabeledDocument(0L, "a b c d e spark", 1.0),
      LabeledDocument(1L, "b d", 0.0),
      LabeledDocument(2L, "spark f g h", 1.0),
      LabeledDocument(3L, "hadoop mapreduce", 0.0)))        


// Configure an ML pipeline, which consists of three stages: tokenizer, hashingTF, and lr.
val tokenizer = new Tokenizer()
      .setInputCol("text")
      .setOutputCol("words")
val hashingTF = new HashingTF()
      .setNumFeatures(1000)
      .setInputCol(tokenizer.getOutputCol)
      .setOutputCol("features")
val lr = new LogisticRegression()
      .setMaxIter(10)
      .setRegParam(0.01)
val pipeline = new Pipeline()
      .setStages(Array(tokenizer, hashingTF, lr))       


// Fit the pipeline to training documents.
val model = pipeline.fit(training.toDF)       

// Prepare test documents, which are unlabeled.
val test = sc.parallelize(Seq(
      Document(4L, "spark i j k"),
      Document(5L, "l m n"),
      Document(6L, "mapreduce spark"),
      Document(7L, "apache hadoop")))       

// Make predictions on test documents.
model.transform(test.toDF)
      .select("id", "text", "probability", "prediction")
      .collect()
      .foreach { case Row(id: Long, text: String, prob: Vector, prediction: Double) =>
        println("($id, $text) --> prob=$prob, prediction=$prediction")
      }       

sc.stop()

The problem is that the LogisticRegression class used by ML use by default 2 classes (line 176) : override val numClasses: Int = 2 问题是ML使用的LogisticRegression类默认情况下使用2个类(第176行):覆盖val numClasses:Int = 2

Any idea how to solve this problem? 任何想法如何解决这个问题?

Thanks 谢谢

As Odomontois already mentioned, if you'd like to use basic NLP pipelines using Spark ML Pipelines you have only 2 options: 正如Odomontois已经提到的那样,如果您想通过Spark ML Pipelines使用基本的NLP管道,则只有两种选择:

  • One vs. Rest and pass existing LogisticRegression, ie new OneVsRest().setClassifier(logisticRegression) 一个与休息并传递现有的LogisticRegression,即new OneVsRest().setClassifier(logisticRegression)
  • Use bag of words ( CountVectorizer in terms of Spark) and NaiveBayes classifier that supports multiclass classification 使用文字包( CountVectorizer Spark中的条款),并NaiveBayes支持多类分类分级

But your test samples only have 2 classes.. Why would it do otherwise in "auto" mode? 但是您的测试样本只有2个类。为什么在“自动”模式下会这样做? You can force to have a multinomial classifer though: 但是,您可以强制使用多项式分类器:

https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression https://spark.apache.org/docs/2.2.0/api/scala/index.html#org.apache.spark.ml.classification.LogisticRegression

val family: Param[String]
Param for the name of family which is a description of the label distribution to be used in the model. Supported options:

"auto": Automatically select the family based on the number of classes: If numClasses == 1 || numClasses == 2, set to "binomial". Else, set to "multinomial"
"binomial": Binary logistic regression with pivoting.
"multinomial": Multinomial logistic (softmax) regression without pivoting. Default is "auto".

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在多类分类上使用Spark ML的Logistic回归模型会给出错误:列预测已存在 - Using Spark ML's Logistic Regression model on MultiClass Classification giving error : Column prediction already exists Spark多类分类示例 - Spark Multiclass Classification Example Spark中带有术语频率的多类分类 - Multiclass classification in Spark with Term Frequency 用于Spark Scala的ML管道 - ML Pipeline for Spark Scala 用于多类分类Spark 2.x的RandomForestClassifier - RandomForestClassifier for multiclass classification Spark 2.x Spark ML Pipeline引发随机森林分类异常:列标签必须为DoubleType类型,但实际上为IntegerType - Spark ML Pipeline throws exception for Random Forest classification: Column label must be of type DoubleType but was actually IntegerType Spark中带有梯度助推树的多类分类:仅支持二进制分类 - Multiclass classification with Gradient Boosting Trees in Spark: only supporting binary classification 使用DecisionTreeModel Spark ML保存管道 - Saving a Pipeline with DecisionTreeModel Spark ML 多类分类,使用 Spark 在 Scala 中更好地显示原始预测 - Multiclass classification, show raw predictions better in Scala with Spark 执行Apache spark ML管道时出错 - Error when executing Apache spark ML pipeline
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM