Spark MLLIB : Compute stddev-like value for Random Forest Regression

Question

I have some data on which I want to learn the 'normal' behavior.

Using a limited set of variables, I managed to do that with simple mean.

df.groupBy([My_Variables]) 
  .agg(
       mean("value").alias("prediction"),
       stddev("value").alias("sigma")
  )

Note : "value" is a double field

I also have done the same thing using a Random Forest algorithm, that allows me to use more variables.

val limit_training_set:Long = 1517439600

val trainingData = df.filter(col("datetime").cast("long")<limit_training_set)
val testData = df.filter(col("datetime").cast("long")>limit_training_set)


val assembler = new VectorAssembler()
      .setInputCols(Array(
        [My_Variables]
      ))
      .setOutputCol("features")

... // (define Indexers and Imputers)

val rf = new RandomForestRegressor()
  .setNumTrees(10) 
  .setMaxDepth(18) 
  .setLabelCol("value")
  .setFeaturesCol("features")

val pipeline = new Pipeline()
    .setStages(Array([Indexers and Imputers], assembler, rf))


val paramGrid = new ParamGridBuilder()
  .addGrid(rf.numTrees, Array(5,10))
  .addGrid(rf.maxDepth, Array(10,18)) 
  .build()

// Set up cross-validation.
val re = new RegressionEvaluator()
  .setMetricName("mae")
  .setLabelCol("value")

val tv = new TrainValidationSplit()
  .setEstimator(pipeline)
  .setEvaluator(re)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for validation.
  .setTrainRatio(0.8)


val model = tv.fit(trainingData)

This give me pretty good predictions, but compared to the Mean method, I lose the standard deviation information, which I would like to have.

Is there a way to compute a stddev-like value using Random Forest in addition to the prediction ? Or is there an other ML algorithm that would fit better for that?

Answer 1

You could add an UnaryTransformer to calculate the score of desired fields. This will add a new field to your Row:

class ScoreField(override val uid: String)
  extends UnaryTransformer[Double, Double, ScoreField]
    with DefaultParamsWritable {

  def this() = this(Identifiable.randomUID("Std"))

  final val mean: DoubleParam = new DoubleParam(this, "mean", "mean")

  final val std: DoubleParam = new DoubleParam(this, "std", "std")

  def setMean(m: Double) = set(mean, m)

  def setStd(s: Double) = set(std, s)

  override protected def createTransformFunc: Double => Double =
    v => { (v - $ { mean } / $ { std }) }

  override protected def outputDataType: DataType = DoubleType

  override def copy(extra: ParamMap): ScoreField = defaultCopy(extra)
}

object ScoreField extends DefaultParamsReadable[ScoreField] {
  def apply() : ScoreField = new ScoreField()
  override def load(path: String): ScoreField = super.load(path)
}

// Create each stage for each field 
val score = new ScoreField()
score.setInputCol("inputField")
score.setOutputCol("outputField")
// Add your mean and std four your field, this must be executed previously
score.setMean(4.0)
score.setStd(2.0)

You have to add to your pipeline:

val pipeline = new Pipeline()
.setStages(Array([Indexers and Imputers], score, assembler, rf))

Once you call transform to your pipeline you will have the the row with your prediction and with your scored fields.

Hope this helps.

Spark MLLIB : Compute stddev-like value for Random Forest Regression

Question

1 answers

solution1
0 2018-05-22 09:31:06

Spark MLLIB : Compute stddev-like value for Random Forest Regression

Question

1 answers

solution1 0 2018-05-22 09:31:06

solution1
0 2018-05-22 09:31:06