[英]Match Dataframe Categorical Variables in vector Spark Scala
我一直在尝试跟踪关于为spark scala中的机器学习ml库创建数据帧的堆栈溢出示例。
但是,我无法使匹配的udf工作。
语法:“种类的参数(Vector,Int,Int,String,String)不符合预期的类型参数类型(类型RT,类型A1,类型A2,类型A3,类型A4)。矢量的类型参数不匹配类型RT的预期参数:类型Vector有一个类型参数,但类型RT没有“
我需要创建一个数据框以输入逻辑回归库。 源样本数据示例包括:
Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0
我想我想要的输出是:
+-------------------+-----+
| features|label|
+-------------------+-----+
|[1.0,9120.50,999] | 0.0|
|[1.0,3897.25,999] | 0.0|
|[2.0,-523.00,999] | 0.0|
|[0.0,-8723.15,999] | 0.0|
+-------------------+-----+
到目前为止,我有:
val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")
val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) =>
val e3 = c match {
case "MASCC2" => 0
case "CACC1" => 1
case "AMXCC1" => 2
}
Vectors.dense(e1, b, c)
}
val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})
val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")
通过使用Spark 2.3.1,我建议使用以下代码进行分类就绪Spark ML Pipeline。 如果要将分类对象包含在Pipeline中,则只需将其添加到我指出的位置即可。 ClassificationPipeline返回PipelineModel。 转换此模型后,您可以获得名为features和label的分类就绪列。
// Handles categorical features
def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setHandleInvalid("skip")
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val pipeline = new Pipeline().setStages(Array(indexer))
(pipeline, inputCol + "_indexed")
}
// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {
// Preprocessing categorical features
val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")
// Use StringIndexer output as input for OneHotEncoderEstimator
val oneHotEncoder = new OneHotEncoderEstimator()
//.setDropLast(true)
//.setHandleInvalid("skip")
.setInputCols(Array("Source_indexed"))
.setOutputCols(Array("Source_indexedVec"))
// Gather features that will be pass through pipeline
val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")
// Put all inputs in a column as a vector
val vectorAssembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("featureVector")
// Scale vector column
val standartScaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
// Create stringindexer for label col
val labelIndexer = new StringIndexer().
setHandleInvalid("skip").
setInputCol("Fraud").
setOutputCol("label")
// create classification object in here
// val classificationObject = new ....
// Create a pipeline
val pipeline = new Pipeline().setStages(
Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
pipeline.fit(df)
}
val pipelineModel = ClassificationPipeline(df)
val transformedDF = pipelineModel.transform(df)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.