[英]Apache Spark MLlib LabeledPoint null label issue
I'm trying to run one of MLlib algorithms, namely LogisticRegressionWithLBFGS on my database. 我正在尝试在数据库上运行MLlib算法之一,即LogisticRegressionWithLBFGS。
This algorithm takes the training set as LabeledPoint. 该算法将训练集作为LabeledPoint。 Since LabeledPoint requires a double label ( LabeledPoint( double label, Vector features) ) and my database contains some null values, how can I solve this problem? 由于LabeledPoint需要双标签(LabeledPoint(double label,Vector features)),并且我的数据库包含一些空值,如何解决此问题?
Here you can see the piece of code related to this issue : 在这里,您可以看到与此问题相关的代码:
val labeled = table.map{ row =>
var s = row.toSeq.toArray
s = s.map(el => if (el != null) el.toString.toDouble)
LabeledPoint(row(0), Vectors.dense((s.take(0) ++ s.drop(1))))
}
And the error that I get: 和我得到的错误:
error : type mismatch;
found : Any
required: Double
Without using LabeledPoint can I run this algorithm or how can I overcome this "null value" issue? 在不使用LabeledPoint的情况下,我可以运行此算法还是如何克服此“空值”问题?
Some reasons why this code cannot work: 此代码无法正常工作的一些原因:
Row.toSeq
is of type () => Seq[Any]
and so is s
Row.toSeq
的类型为() => Seq[Any]
, s
也是如此 el => if (el != null) el.toString.toDouble
is of type T => AnyVal
(where T
is any). 因为您只覆盖了不为null的情况,所以el => if (el != null) el.toString.toDouble
的类型为T => AnyVal
(其中T
为任意)。 If el
is null
it returns Unit
如果el
为null
则返回Unit
var
of type Seq[Any]
this is exactly what you get. 即使不是,您也将其分配给Seq[Any]
类型的var
,这正是您所得到的。 One way or another it is not a valid input for Vectors.dense
一种或另一种方式不是Vectors.dense
的有效输入 Row.apply
is of type Int => Any
so the output cannot be used as a label Row.apply
的类型为Int => Any
因此输出不能用作标签 Should work but have no effect: 应该可以,但是没有效果:
s.take(0)
May stop working in Spark 2.0 可能会停止在Spark 2.0中工作
map
over DataFrame
- not much we can do about it now since Vector
class has no encoder available. 在DataFrame
map
进行DataFrame
-由于Vector
类没有可用的编码器,因此我们现在不能做太多事情。 How you can approach this: 如何解决这个问题:
either filter complete rows or fill missing values for example using DataFrameNaFunctions
: 使用DataFrameNaFunctions
过滤完整的行或填充缺少的值:
// You definitely want something smarter than that val fixed = df.na.fill(0.0) // or val filtered = df.na.drop
use VectorAssembler
to build vectors: 使用VectorAssembler
构建矢量:
import org.apache.spark.ml.feature.VectorAssembler val assembler = new VectorAssembler() .setInputCols(df.columns.tail) .setOutputCol("features") val assembled = assembler.transform(fixed)
convert to LabledPoint
转换为LabledPoint
import org.apache.spark.mllib.regression.LabeledPoint // Assuming lable column is called label assembled.select($"label", $"features").rdd.map { case Row(label: Double, features: Vector) => LabeledPoint(label, features) }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.