匹配向量Spark Scala中的Dataframe分类变量

Question

我一直在尝试跟踪关于为spark scala中的机器学习ml库创建数据帧的堆栈溢出示例。

但是，我无法使匹配的udf工作。

语法：“种类的参数（Vector，Int，Int，String，String）不符合预期的类型参数类型（类型RT，类型A1，类型A2，类型A3，类型A4）。矢量的类型参数不匹配类型RT的预期参数：类型Vector有一个类型参数，但类型RT没有“

我需要创建一个数据框以输入逻辑回归库。 源样本数据示例包括：

Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0

我想我想要的输出是：

+-------------------+-----+
|           features|label|
+-------------------+-----+
|[1.0,9120.50,999]  |  0.0|
|[1.0,3897.25,999]  |  0.0|
|[2.0,-523.00,999]  |  0.0|
|[0.0,-8723.15,999] |  0.0|
+-------------------+-----+

到目前为止，我有：

val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")

val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) => 
  val e3 = c match {
    case "MASCC2" => 0
    case "CACC1" => 1
    case "AMXCC1" => 2
  }
  Vectors.dense(e1, b, c) 
}

val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})

val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")

如何在Spark ML中创建正确的分类数据框

Answer 1

通过使用Spark 2.3.1，我建议使用以下代码进行分类就绪Spark ML Pipeline。 如果要将分类对象包含在Pipeline中，则只需将其添加到我指出的位置即可。 ClassificationPipeline返回PipelineModel。 转换此模型后，您可以获得名为features和label的分类就绪列。

// Handles categorical features
 def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
      val indexer = new StringIndexer()
        .setHandleInvalid("skip")
        .setInputCol(inputCol)
        .setOutputCol(inputCol + "_indexed")
      val pipeline = new Pipeline().setStages(Array(indexer))
      (pipeline, inputCol + "_indexed")
    }

// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {

  // Preprocessing categorical features
  val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")

  // Use StringIndexer output as input for OneHotEncoderEstimator
  val oneHotEncoder = new OneHotEncoderEstimator()
    //.setDropLast(true)
    //.setHandleInvalid("skip")
    .setInputCols(Array("Source_indexed"))
    .setOutputCols(Array("Source_indexedVec"))


  // Gather features that will be pass through pipeline
  val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")

  // Put all inputs in a column as a vector
  val vectorAssembler = new VectorAssembler()
    .setInputCols(inputCols)
    .setOutputCol("featureVector")

  // Scale vector column
  val standartScaler = new StandardScaler()
    .setInputCol("featureVector")
    .setOutputCol("features")
    .setWithStd(true)
    .setWithMean(false)

  // Create stringindexer for label col
  val labelIndexer = new StringIndexer().
    setHandleInvalid("skip").
    setInputCol("Fraud").
    setOutputCol("label")

  // create classification object in here 
  // val classificationObject = new ....


  // Create a pipeline
  val pipeline = new Pipeline().setStages(
    Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
  pipeline.fit(df)



   }

val pipelineModel = ClassificationPipeline(df)

val transformedDF = pipelineModel.transform(df)

匹配向量Spark Scala中的Dataframe分类变量

问题描述

1 个解决方案

解决方案1
0 2018-08-20 04:39:54

匹配向量Spark Scala中的Dataframe分类变量

问题描述

1 个解决方案

解决方案1 0 2018-08-20 04:39:54

解决方案1
0 2018-08-20 04:39:54