简体   繁体   English

Apache Spark MLlib LabeledPoint空标签问题

[英]Apache Spark MLlib LabeledPoint null label issue

I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database. 我正在尝试在数据库上运行MLlib算法之一,即LogisticRegressionWithLBFGS。

This algorithm takes the training set as LabeledPoint. 该算法将训练集作为LabeledPoint。 Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem? 由于LabeledPoint需要双标签(LabeledPoint(double label,Vector features)),并且我的数据库包含一些空值,如何解决此问题?

Here you can see the piece of code related to this issue : 在这里,您可以看到与此问题相关的代码:

val labeled = table.map{ row => 
    var s = row.toSeq.toArray           
    s = s.map(el => if (el != null) el.toString.toDouble)
    LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
    }

And the error that I get: 和我得到的错误:

error   : type mismatch;
found   : Any
required: Double

Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue? 在不使用LabeledPoint的情况下,我可以运行此算法还是如何克服此“空值”问题?

Some reasons why this code cannot work: 此代码无法正常工作的一些原因:

  • Row.toSeq is of type () => Seq[Any] and so is s Row.toSeq的类型为() => Seq[Any]s也是如此
  • since you cover only not null case el => if (el != null) el.toString.toDouble is of type T => AnyVal (where T is any). 因为您只覆盖了不为null的情况,所以el => if (el != null) el.toString.toDouble的类型为T => AnyVal (其中T为任意)。 If el is null it returns Unit 如果elnull则返回Unit
  • even if it wasn't you assign it to var of type Seq[Any] this is exactly what you get. 即使不是,您也将其分配给Seq[Any]类型的var ,这正是您所得到的。 One way or another it is not a valid input for Vectors.dense 一种或另一种方式不是Vectors.dense的有效输入
  • Row.apply is of type Int => Any so the output cannot be used as a label Row.apply的类型为Int => Any因此输出不能用作标签

Should work but have no effect: 应该可以,但是没有效果:

  • s.take(0)

May stop working in Spark 2.0 可能会停止在Spark 2.0中工作

  • map over DataFrame - not much we can do about it now since Vector class has no encoder available. DataFrame map进行DataFrame -由于Vector类没有可用的编码器,因此我们现在不能做太多事情。

How you can approach this: 如何解决这个问题:

  • either filter complete rows or fill missing values for example using DataFrameNaFunctions : 使用DataFrameNaFunctions过滤完整的行或填充缺少的值:

      // You definitely want something smarter than that val fixed = df.na.fill(0.0) // or val filtered = df.na.drop 
  • use VectorAssembler to build vectors: 使用VectorAssembler构建矢量:

     import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler() .setInputCols(df.columns.tail) .setOutputCol("features") val assembled = assembler.transform(fixed) 
  • convert to LabledPoint 转换为LabledPoint

     import org.apache.spark.mllib.regression.LabeledPoint // Assuming lable column is called label assembled.select($"label", $"features").rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) } 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用Apache Spark中的Scala - MLLib转换LabeledPoint中的Vector的RDD - Convert RDD of Vector in LabeledPoint using Scala - MLLib in Apache Spark ClassCastException:java.lang.Double 不能转换为 org。 apache.spark.mllib.linalg.Vector 使用 LabeledPoint 时 - ClassCastException: java.lang.Double cannot be cast to org. apache.spark.mllib.linalg.Vector While using LabeledPoint Apache Spark - MlLib - 协作过滤 - Apache Spark — MlLib — Collaborative filtering 朴素贝叶斯与Apache Spark MLlib - Naive Bayes with Apache Spark MLlib Apache Spark MLLib获得最大价值 - Apache Spark MLLib get maximum value IllegalArgumentException:'字段“ label”不存在Spark MLlib - IllegalArgumentException: 'Field “label” does not exist Spark MLlib Apache Spark MLlib:Java中的OLS回归 - Apache Spark MLlib : OLS regression in Java spark-jobserver和mllib问题正在运行的作业 - spark-jobserver and mllib issue running jobs Spark 2.2:从文件加载 org.apache.spark.ml.feature.LabeledPoint - Spark 2.2: Load org.apache.spark.ml.feature.LabeledPoint from file Spark MLib-从RDD [Vector]功能和RDD [Vector]标签创建LabeledPoint - Spark MLib - Create LabeledPoint from RDD[Vector] features and RDD[Vector] label
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM