简体   繁体   中英

Too many Accumulators in Spark Job

My spark app has 40 Accumulators

object MySparkApp { 
  def main(args: Array[String]): Unit = {
    // initialize SparkContext

    val acc1 = sc.accumulator(0)
    val acc2 = sc.accumulator(0)
    .
    .
    val acc40 = sc.accumulator(0)

    val logRdd = sc.textFile("input/path").map(x => parser.parse(x))
    logRdd.forEach(x => incrementCounter(x, acc1, acc2,..... acc40))
  }
}

This code is very ugly what would be a good way to wrap these accumulators in something like an object and make the code more readable.

One option would be to implement an Accumulator for a Map[String, Long] type - then add an entry with the "bad" field name as the key, and 1 as the value, for every occurrence of bad values in the data:

Implementation of the accumulator param:

class StringToLongAccumulatorParam extends AccumulatorParam[Map[String, Long]] {
  override def addInPlace(r1: Map[String, Long], r2: Map[String, Long]): Map[String, Long] = {
    // merging the maps:
    r1 ++ r2.map{ case (k,v) => k -> (v + r1.getOrElse(k,0L)) }
  }

  override def zero(initialValue: Map[String, Long]): Map[String, Long] = Map[String, Long]()
}

Then you can use it by creating an implicit val with an instance of this param and then creating and using the appropriate accumulator:

implicit val accParam = new StringToLongAccumulatorParam()
val accumulator = sc.accumulator[Map[String, Long]](Map[String, Long]())
val rdd2 = rdd.map(v => { accumulator += Map("FieldName" -> 1); v })

Of course - change "FieldName" to whatever you need. For each record, you can create a map with as many entries as you'd like and just add it to the accumulator using += .

NOTE : I'm not sure this will perform so great if you have a lot of these erroneous values - but if most of your records won't end up creating these maps, it should be negligible. If most records DO have null/bad values, perhaps this should not be done via accumulators, but via actual RDD operations (map to 1 s where the value is bad and reduce?)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM