简体   繁体   中英

Spark MLLIB : Compute stddev-like value for Random Forest Regression

I have some data on which I want to learn the 'normal' behavior.

Using a limited set of variables, I managed to do that with simple mean.

df.groupBy([My_Variables]) 
  .agg(
       mean("value").alias("prediction"),
       stddev("value").alias("sigma")
  )

Note : "value" is a double field

I also have done the same thing using a Random Forest algorithm, that allows me to use more variables.

val limit_training_set:Long = 1517439600

val trainingData = df.filter(col("datetime").cast("long")<limit_training_set)
val testData = df.filter(col("datetime").cast("long")>limit_training_set)


val assembler = new VectorAssembler()
      .setInputCols(Array(
        [My_Variables]
      ))
      .setOutputCol("features")

... // (define Indexers and Imputers)

val rf = new RandomForestRegressor()
  .setNumTrees(10) 
  .setMaxDepth(18) 
  .setLabelCol("value")
  .setFeaturesCol("features")

val pipeline = new Pipeline()
    .setStages(Array([Indexers and Imputers], assembler, rf))


val paramGrid = new ParamGridBuilder()
  .addGrid(rf.numTrees, Array(5,10))
  .addGrid(rf.maxDepth, Array(10,18)) 
  .build()

// Set up cross-validation.
val re = new RegressionEvaluator()
  .setMetricName("mae")
  .setLabelCol("value")

val tv = new TrainValidationSplit()
  .setEstimator(pipeline)
  .setEvaluator(re)
  .setEstimatorParamMaps(paramGrid)
  // 80% of the data will be used for training and the remaining 20% for validation.
  .setTrainRatio(0.8)


val model = tv.fit(trainingData)

This give me pretty good predictions, but compared to the Mean method, I lose the standard deviation information, which I would like to have.

Is there a way to compute a stddev-like value using Random Forest in addition to the prediction ? Or is there an other ML algorithm that would fit better for that?

You could add an UnaryTransformer to calculate the score of desired fields. This will add a new field to your Row:

class ScoreField(override val uid: String)
  extends UnaryTransformer[Double, Double, ScoreField]
    with DefaultParamsWritable {

  def this() = this(Identifiable.randomUID("Std"))

  final val mean: DoubleParam = new DoubleParam(this, "mean", "mean")

  final val std: DoubleParam = new DoubleParam(this, "std", "std")

  def setMean(m: Double) = set(mean, m)

  def setStd(s: Double) = set(std, s)

  override protected def createTransformFunc: Double => Double =
    v => { (v - $ { mean } / $ { std }) }

  override protected def outputDataType: DataType = DoubleType

  override def copy(extra: ParamMap): ScoreField = defaultCopy(extra)
}

object ScoreField extends DefaultParamsReadable[ScoreField] {
  def apply() : ScoreField = new ScoreField()
  override def load(path: String): ScoreField = super.load(path)
}

// Create each stage for each field 
val score = new ScoreField()
score.setInputCol("inputField")
score.setOutputCol("outputField")
// Add your mean and std four your field, this must be executed previously
score.setMean(4.0)
score.setStd(2.0)

You have to add to your pipeline:

val pipeline = new Pipeline()
.setStages(Array([Indexers and Imputers], score, assembler, rf))

Once you call transform to your pipeline you will have the the row with your prediction and with your scored fields.

Hope this helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM