匹配向量Spark Scala中的Dataframe分类变量

Question

I have been trying to follow the stack overflow example about creating dataframes for machine learning ml library in spark scala. 我一直在尝试跟踪关于为spark scala中的机器学习ml库创建数据帧的堆栈溢出示例。

How to create correct data frame for classification in Spark ML 如何在Spark ML中创建正确的分类数据框

However, I cannot get the matching udf to work. 但是，我无法使匹配的udf工作。

Syntax: "kinds of the type arguments (Vector,Int,Int,String,String) do not conform to the expected kinds of the type parameters (type RT,type A1,type A2,type A3,type A4). Vector's type parameters do not match type RT's expected parameters: type Vector has one type parameter, but type RT has none" 语法：“种类的参数（Vector，Int，Int，String，String）不符合预期的类型参数类型（类型RT，类型A1，类型A2，类型A3，类型A4）。矢量的类型参数不匹配类型RT的预期参数：类型Vector有一个类型参数，但类型RT没有“

I need to create a dataframe to input into the logistic regression library. 我需要创建一个数据框以输入逻辑回归库。 Source sample data example has: 源样本数据示例包括：

Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0

I suppose my desired output is: 我想我想要的输出是：

+-------------------+-----+
|           features|label|
+-------------------+-----+
|[1.0,9120.50,999]  |  0.0|
|[1.0,3897.25,999]  |  0.0|
|[2.0,-523.00,999]  |  0.0|
|[0.0,-8723.15,999] |  0.0|
+-------------------+-----+

So far I have: 到目前为止，我有：

val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")

val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) => 
  val e3 = c match {
    case "MASCC2" => 0
    case "CACC1" => 1
    case "AMXCC1" => 2
  }
  Vectors.dense(e1, b, c) 
}

val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})

val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")

How to create correct data frame for classification in Spark ML 如何在Spark ML中创建正确的分类数据框

Answer 1

By using Spark 2.3.1 I suggest following codes for classification ready Spark ML Pipeline. 通过使用Spark 2.3.1，我建议使用以下代码进行分类就绪Spark ML Pipeline。 If you want to include classification object into Pipeline you need to just add it where I point out. 如果要将分类对象包含在Pipeline中，则只需将其添加到我指出的位置即可。 ClassificationPipeline returns a PipelineModel. ClassificationPipeline返回PipelineModel。 Once you transform this model you can get a classification ready columns named features and label. 转换此模型后，您可以获得名为features和label的分类就绪列。

// Handles categorical features
 def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
      val indexer = new StringIndexer()
        .setHandleInvalid("skip")
        .setInputCol(inputCol)
        .setOutputCol(inputCol + "_indexed")
      val pipeline = new Pipeline().setStages(Array(indexer))
      (pipeline, inputCol + "_indexed")
    }

// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {

  // Preprocessing categorical features
  val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")

  // Use StringIndexer output as input for OneHotEncoderEstimator
  val oneHotEncoder = new OneHotEncoderEstimator()
    //.setDropLast(true)
    //.setHandleInvalid("skip")
    .setInputCols(Array("Source_indexed"))
    .setOutputCols(Array("Source_indexedVec"))


  // Gather features that will be pass through pipeline
  val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")

  // Put all inputs in a column as a vector
  val vectorAssembler = new VectorAssembler()
    .setInputCols(inputCols)
    .setOutputCol("featureVector")

  // Scale vector column
  val standartScaler = new StandardScaler()
    .setInputCol("featureVector")
    .setOutputCol("features")
    .setWithStd(true)
    .setWithMean(false)

  // Create stringindexer for label col
  val labelIndexer = new StringIndexer().
    setHandleInvalid("skip").
    setInputCol("Fraud").
    setOutputCol("label")

  // create classification object in here 
  // val classificationObject = new ....


  // Create a pipeline
  val pipeline = new Pipeline().setStages(
    Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
  pipeline.fit(df)



   }

val pipelineModel = ClassificationPipeline(df)

val transformedDF = pipelineModel.transform(df)

匹配向量Spark Scala中的Dataframe分类变量

问题描述

1 个解决方案

解决方案1
0 2018-08-20 04:39:54

匹配向量Spark Scala中的Dataframe分类变量

问题描述

1 个解决方案

解决方案1 0 2018-08-20 04:39:54

解决方案1
0 2018-08-20 04:39:54