简体   繁体   中英

Spark Accumulator

I am new to accumulators in Spark . I have created an accumulator which gathers the information of sum and count of all columns in a dataframe into a Map. Which is not functioning as expected , so I have a few doubts.

When I run this class( pasted below) in local mode , I can see that the accumulators getting updated but the final value is still empty.For debug purposes, I added a print Statement in add() .

Q1) Why is the final accumulable not being updated when the accumulator is being added ?

For reference , I studied the CollectionsAccumulator where they have made use of SynchronizedList from Java Collections.

Q2) Does it need to be a synchronized/concurrent collection for an accumulator to update ?

Q3) Which collection will be best suited for such purpose ?

I have attached my execution flow along with spark ui snapshot for analysis.

Thanks.

EXECUTION:

INPUT DATAFRAME -

+-------+-------+
|Column1|Column2|
+-------+-------+
|1      |2      |
|3      |4      |
+-------+-------+

OUTPUT -

Add - Map(Column1 -> Map(sum -> 1, count -> 1), Column2 -> Map(sum -> 2, count -> 1))

Add - Map(Column1 -> Map(sum -> 4, count -> 2), Column2 -> Map(sum -> 6, count -> 2))

TestRowAccumulator(id: 1, name: Some(Test Accumulator for Sum&Count), value: Map())

SPARK UI SNAPSHOT -

SPARK 用户界面 -

CLASS :

class TestRowAccumulator extends AccumulatorV2[Row,Map[String,Map[String,Int]]]{

  private var colMetrics: Map[String, Map[String, Int]] = Map[String , Map[String , Int]]()


  override def isZero: Boolean = this.colMetrics.isEmpty

  override def copy(): AccumulatorV2[Row, Map[String,Map[String,Int]]] = {
    val racc = new TestRowAccumulator
    racc.colMetrics = colMetrics
    racc
  }

  override def reset(): Unit = {
    colMetrics = Map[String,Map[String,Int]]()
  }

  override def add(v: Row): Unit = {

    v.schema.foreach(field => {
      val name: String = field.name
      val value: Int = v.getAs[Int](name)
      if(!colMetrics.contains(name))
        {
          colMetrics = colMetrics ++ Map(name -> Map("sum" -> value , "count" -> 1 ))
        }else
        {
          val metric = colMetrics(name)
          val sum = metric("sum") + value
          val count = metric("count") + 1

          colMetrics = colMetrics ++ Map(name -> Map("sum" -> sum , "count" -> count))
        }
    })
  }

  override def merge(other: AccumulatorV2[Row, Map[String,Map[String,Int]]]): Unit = {
    other match {
      case t:TestRowAccumulator => {
        colMetrics.map(col => {
          val map2: Map[String, Int] = t.colMetrics.getOrElse(col._1 , Map())
          val map1: Map[String, Int] = col._2
          map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
        } )
      }
      case _ => throw new UnsupportedOperationException(s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
    }
  }

  override def value: Map[String, Map[String, Int]] = {
    colMetrics
  }
}

After a bit of debug , I found that merge function is being called . It had erroneous code so the accumulable value was Map()

EXECUTION FlOW OF ACCUMULATOR (LOCAL MODE) : ADD ADD MERGE

Once I corrected the merge function , accumulator worked as expected

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM