繁体   English   中英

匹配向量Spark Scala中的Dataframe分类变量

[英]Match Dataframe Categorical Variables in vector Spark Scala

我一直在尝试跟踪关于为spark scala中的机器学习ml库创建数据帧的堆栈溢出示例。

如何在Spark ML中创建正确的分类数据框

但是,我无法使匹配的udf工作。

语法:“种类的参数(Vector,Int,Int,String,String)不符合预期的类型参数类型(类型RT,类型A1,类型A2,类型A3,类型A4)。矢量的类型参数不匹配类型RT的预期参数:类型Vector有一个类型参数,但类型RT没有“

我需要创建一个数据框以输入逻辑回归库。 源样本数据示例包括:

Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0

我想我想要的输出是:

+-------------------+-----+
|           features|label|
+-------------------+-----+
|[1.0,9120.50,999]  |  0.0|
|[1.0,3897.25,999]  |  0.0|
|[2.0,-523.00,999]  |  0.0|
|[0.0,-8723.15,999] |  0.0|
+-------------------+-----+

到目前为止,我有:

val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")

val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) => 
  val e3 = c match {
    case "MASCC2" => 0
    case "CACC1" => 1
    case "AMXCC1" => 2
  }
  Vectors.dense(e1, b, c) 
}

val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})

val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")

如何在Spark ML中创建正确的分类数据框

通过使用Spark 2.3.1,我建议使用以下代码进行分类就绪Spark ML Pipeline。 如果要将分类对象包含在Pipeline中,则只需将其添加到我指出的位置即可。 ClassificationPipeline返回PipelineModel。 转换此模型后,您可以获得名为features和label的分类就绪列。

// Handles categorical features
 def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
      val indexer = new StringIndexer()
        .setHandleInvalid("skip")
        .setInputCol(inputCol)
        .setOutputCol(inputCol + "_indexed")
      val pipeline = new Pipeline().setStages(Array(indexer))
      (pipeline, inputCol + "_indexed")
    }

// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {

  // Preprocessing categorical features
  val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")

  // Use StringIndexer output as input for OneHotEncoderEstimator
  val oneHotEncoder = new OneHotEncoderEstimator()
    //.setDropLast(true)
    //.setHandleInvalid("skip")
    .setInputCols(Array("Source_indexed"))
    .setOutputCols(Array("Source_indexedVec"))


  // Gather features that will be pass through pipeline
  val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")

  // Put all inputs in a column as a vector
  val vectorAssembler = new VectorAssembler()
    .setInputCols(inputCols)
    .setOutputCol("featureVector")

  // Scale vector column
  val standartScaler = new StandardScaler()
    .setInputCol("featureVector")
    .setOutputCol("features")
    .setWithStd(true)
    .setWithMean(false)

  // Create stringindexer for label col
  val labelIndexer = new StringIndexer().
    setHandleInvalid("skip").
    setInputCol("Fraud").
    setOutputCol("label")

  // create classification object in here 
  // val classificationObject = new ....


  // Create a pipeline
  val pipeline = new Pipeline().setStages(
    Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
  pipeline.fit(df)



   }

val pipelineModel = ClassificationPipeline(df)

val transformedDF = pipelineModel.transform(df)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM