I have been trying to follow the stack overflow example about creating dataframes for machine learning ml library in spark scala.
How to create correct data frame for classification in Spark ML
However, I cannot get the matching udf to work.
Syntax: "kinds of the type arguments (Vector,Int,Int,String,String) do not conform to the expected kinds of the type parameters (type RT,type A1,type A2,type A3,type A4). Vector's type parameters do not match type RT's expected parameters: type Vector has one type parameter, but type RT has none"
I need to create a dataframe to input into the logistic regression library. Source sample data example has:
Source, Amount, Account, Fraud
CACC1, 9120.50, 999, 0
CACC2, 3897.25, 999, 0
AMXCC1, -523, 999, 0
MASCC2, -8723.15, 999, 0
I suppose my desired output is:
+-------------------+-----+
| features|label|
+-------------------+-----+
|[1.0,9120.50,999] | 0.0|
|[1.0,3897.25,999] | 0.0|
|[2.0,-523.00,999] | 0.0|
|[0.0,-8723.15,999] | 0.0|
+-------------------+-----+
So far I have:
val df = sqlContext.sql("select * from prediction_test")
val df_2 = df.select("source","amount","account")
val toVec3 = udf[Vector,String,Int,Int] { (a,b,c) =>
val e3 = c match {
case "MASCC2" => 0
case "CACC1" => 1
case "AMXCC1" => 2
}
Vectors.dense(e1, b, c)
}
val encodeLabel = udf[Double, Int](_match{case "0" => 0.0 case "1" => 1.0})
val df_3 = df_2.withColumn("features", toVec3(df_2("source"),df_2("amount"),df_2("account")).withColumn("label", encodeLabel(df("fraud"))).select("features","label")
How to create correct data frame for classification in Spark ML
By using Spark 2.3.1 I suggest following codes for classification ready Spark ML Pipeline. If you want to include classification object into Pipeline you need to just add it where I point out. ClassificationPipeline returns a PipelineModel. Once you transform this model you can get a classification ready columns named features and label.
// Handles categorical features
def stringIndexerPipeline(inputCol: String): (Pipeline, String) = {
val indexer = new StringIndexer()
.setHandleInvalid("skip")
.setInputCol(inputCol)
.setOutputCol(inputCol + "_indexed")
val pipeline = new Pipeline().setStages(Array(indexer))
(pipeline, inputCol + "_indexed")
}
// Classification Pipeline Function
def ClassificationPipeline(df:DataFrame): PipelineModel = {
// Preprocessing categorical features
val (SourcePipeline, Source_indexed) = stringIndexerPipeline("Source")
// Use StringIndexer output as input for OneHotEncoderEstimator
val oneHotEncoder = new OneHotEncoderEstimator()
//.setDropLast(true)
//.setHandleInvalid("skip")
.setInputCols(Array("Source_indexed"))
.setOutputCols(Array("Source_indexedVec"))
// Gather features that will be pass through pipeline
val inputCols = oneHotEncoder.getOutputCols ++ Array("Amount","Account")
// Put all inputs in a column as a vector
val vectorAssembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("featureVector")
// Scale vector column
val standartScaler = new StandardScaler()
.setInputCol("featureVector")
.setOutputCol("features")
.setWithStd(true)
.setWithMean(false)
// Create stringindexer for label col
val labelIndexer = new StringIndexer().
setHandleInvalid("skip").
setInputCol("Fraud").
setOutputCol("label")
// create classification object in here
// val classificationObject = new ....
// Create a pipeline
val pipeline = new Pipeline().setStages(
Array(SourcePipeline, oneHotEncoder, vectorAssembler, standartScaler, labelIndexer/*, classificationObject*/))
pipeline.fit(df)
}
val pipelineModel = ClassificationPipeline(df)
val transformedDF = pipelineModel.transform(df)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.