How to evaluate the performance of the model (accuracy) in Spark Pipeline with Linear Regression

Question

Trying to run a Spark Pipeline with Linear Regression, I was able to execute the model, and looking for

To find the model efficiency and other metrics for which I need model summary, I found some Python example which I have commented below for reference.

       import org.apache.spark.ml.feature.VectorAssembler
       import spark.implicits._
       import org.apache.spark.sql
       import org.apache.spark.sql.functions._
       import org.apache.spark.sql.types.DecimalType
       import org.apache.spark.sql.{Dataset, Row, SparkSession}
       import org.apache.spark.ml.regression.LinearRegression
       import org.apache.spark.ml.feature.OneHotEncoderEstimator
       import org.apache.spark.ml.{Pipeline, PipelineModel}    

       val splitDF: Array[Dataset[Row]] = inputDF.randomSplit(Array(0.5, 0.5))
        val trainingDF = splitDF(0)
        val testingDF = splitDF(1) 


        val encoder = new OneHotEncoderEstimator()
          .setInputCols(Array("_LookUpID"))
          .setOutputCols(Array("_LookUpID_Encoded"))

        val requiredFeatures = Array("_LookUpID_Encoded","VALUE1")
        val assembler = new VectorAssembler()
          .setInputCols(requiredFeatures)
          .setOutputCol("features")


        val lr = new LinearRegression()
          .setMaxIter(10)
          .setRegParam(0.3)
          .setElasticNetParam(0.8)
          .setFeaturesCol("features")
          .setLabelCol("VALUE2")

        // Fit the model
        val pipeline = new Pipeline()
          .setStages(Array(encoder, assembler, lr))

        // Fit the pipeline to training documents.
        val lrModel = pipeline.fit(trainingDF)

        val predictions = lrModel.transform(testingDF)
        println("*** Predictions ***")
        predictions.printSchema()  

predictions.select("VALUE_DATE","_LookUpID","_CD","VALUE1","VALUE2","prediction").show(100)

        val rm = new RegressionMetrics(predictions.rdd.map(x => (x(4).asInstanceOf[Double], x(5).asInstanceOf[Double])))
        println("sqrt(MSE): " + Math.sqrt(rm.meanSquaredError))
        println("R Squared: " + rm.r2)
        println("Explained Variance: " + rm.explainedVariance + "\n")

Ingestion with partitions

def getDataFrame(sql: String, lowerNumber: Int, upperNumber: Int): DataFrame = {
 val inputDF: DataFrame = 
 spark.read.format(source = "jdbc")
  .option("url", "jdbc:oracle:thin:@//url")
        .option("user", "user")
        .option("password", "password")
        .option("driver", "oracle.jdbc.OracleDriver")
        .option("dbtable", s"($sql)")
        .option("partitionColumn", "_LookUpID")
        .option("numPartitions", "6")
        .option("lowerBound", lowerNumber)
        .option("upperBound", upperNumber)
        .load()
 inputDF
}

The following pipleline runs out of memory (java.lang.OutOfMemoryError: Java heap space at...) if I feed a dataset with 1Million rows (works fine at 100K) even if the Job is allocated 32GB memory. Tried .cache() the inputDF without much success. Is it because of encoding the _LookUpID, what else can I do differently Update : Have increased the heap memory on the driver along with number of partitions and was able to resolve it .

Thanks

Answer 1

Updated the question with with RegressionMetrics to fetch RMSE and R Squared etc for metrics

Partitioned Dataset and increased the heap memory for the Driver which resolved memory issues for now. Will keep monitoring

How to evaluate the performance of the model (accuracy) in Spark Pipeline with Linear Regression

Question

1 answers

solution1
0 ACCPTED 2020-03-16 19:13:01

How to evaluate the performance of the model (accuracy) in Spark Pipeline with Linear Regression

Question

1 answers

solution1 0 ACCPTED 2020-03-16 19:13:01

solution1
0 ACCPTED 2020-03-16 19:13:01