Apache Spark MLlib LabeledPoint空标签问题

Question

I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database. 我正在尝试在数据库上运行MLlib算法之一，即LogisticRegressionWithLBFGS。

This algorithm takes the training set as LabeledPoint. 该算法将训练集作为LabeledPoint。 Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem? 由于LabeledPoint需要双标签（LabeledPoint（double label，Vector features）），并且我的数据库包含一些空值，如何解决此问题？

Here you can see the piece of code related to this issue : 在这里，您可以看到与此问题相关的代码：

val labeled = table.map{ row => 
    var s = row.toSeq.toArray           
    s = s.map(el => if (el != null) el.toString.toDouble)
    LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
    }

And the error that I get: 和我得到的错误：

error   : type mismatch;
found   : Any
required: Double

Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue? 在不使用LabeledPoint的情况下，我可以运行此算法还是如何克服此“空值”问题？

Answer 1

Some reasons why this code cannot work: 此代码无法正常工作的一些原因：

Row.toSeq is of type () => Seq[Any] and so is s Row.toSeq的类型为() => Seq[Any] ， s也是如此
since you cover only not null case el => if (el != null) el.toString.toDouble is of type T => AnyVal (where T is any). 因为您只覆盖了不为null的情况，所以el => if (el != null) el.toString.toDouble的类型为T => AnyVal （其中T为任意）。 If el is null it returns Unit 如果el为null则返回Unit
even if it wasn't you assign it to var of type Seq[Any] this is exactly what you get. 即使不是，您也将其分配给Seq[Any]类型的var ，这正是您所得到的。 One way or another it is not a valid input for Vectors.dense 一种或另一种方式不是Vectors.dense的有效输入
Row.apply is of type Int => Any so the output cannot be used as a label Row.apply的类型为Int => Any因此输出不能用作标签

Should work but have no effect: 应该可以，但是没有效果：

s.take(0)

May stop working in Spark 2.0 可能会停止在Spark 2.0中工作

map over DataFrame - not much we can do about it now since Vector class has no encoder available. 在DataFrame map进行DataFrame -由于Vector类没有可用的编码器，因此我们现在不能做太多事情。

How you can approach this: 如何解决这个问题：

either filter complete rows or fill missing values for example using DataFrameNaFunctions : 使用DataFrameNaFunctions过滤完整的行或填充缺少的值：
```
  // You definitely want something smarter than that val fixed = df.na.fill(0.0) // or val filtered = df.na.drop 
```

use VectorAssembler to build vectors: 使用VectorAssembler构建矢量：

 import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler() .setInputCols(df.columns.tail) .setOutputCol("features") val assembled = assembler.transform(fixed)

convert to LabledPoint 转换为LabledPoint

 import org.apache.spark.mllib.regression.LabeledPoint // Assuming lable column is called label assembled.select($"label", $"features").rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) }

Apache Spark MLlib LabeledPoint空标签问题

问题描述

1 个解决方案

解决方案1
2 已采纳 2016-02-27 03:46:20

Apache Spark MLlib LabeledPoint空标签问题

问题描述

1 个解决方案

解决方案1 2 已采纳 2016-02-27 03:46:20

解决方案1
2 已采纳 2016-02-27 03:46:20