[英]Spark Accumulator
I am new to accumulators in Spark .我是 Spark 中的累加器的新手。 I have created an accumulator which gathers the information of sum and count of all columns in a dataframe into a Map.
我创建了一个累加器,它将数据框中所有列的总和和计数信息收集到一个 Map 中。 Which is not functioning as expected , so I have a few doubts.
哪个没有按预期运行,所以我有一些疑问。
When I run this class( pasted below) in local mode , I can see that the accumulators getting updated but the final value is still empty.For debug purposes, I added a print Statement in add() .当我在本地模式下运行这个类(粘贴在下面)时,我可以看到累加器得到更新,但最终值仍然是空的。出于调试目的,我在 add() 中添加了一个打印语句。
Q1) Why is the final accumulable not being updated when the accumulator is being added ? Q1) 为什么在添加累加器时最终的累加器没有更新?
For reference , I studied the CollectionsAccumulator where they have made use of SynchronizedList from Java Collections.作为参考,我研究了 CollectionsAccumulator,他们在其中使用了 Java Collections 中的 SynchronizedList。
Q2) Does it need to be a synchronized/concurrent collection for an accumulator to update ? Q2) 累加器更新是否需要同步/并发集合?
Q3) Which collection will be best suited for such purpose ? Q3) 哪个系列最适合这样的目的?
I have attached my execution flow along with spark ui snapshot for analysis.我附上了我的执行流程和 spark ui 快照以供分析。
Thanks.谢谢。
EXECUTION:执行:
INPUT DATAFRAME -输入数据帧 -
+-------+-------+
|Column1|Column2|
+-------+-------+
|1 |2 |
|3 |4 |
+-------+-------+
OUTPUT -输出 -
Add - Map(Column1 -> Map(sum -> 1, count -> 1), Column2 -> Map(sum -> 2, count -> 1))添加 - Map(Column1 -> Map(sum -> 1, count -> 1), Column2 -> Map(sum -> 2, count -> 1))
Add - Map(Column1 -> Map(sum -> 4, count -> 2), Column2 -> Map(sum -> 6, count -> 2))添加 - Map(Column1 -> Map(sum -> 4, count -> 2), Column2 -> Map(sum -> 6, count -> 2))
TestRowAccumulator(id: 1, name: Some(Test Accumulator for Sum&Count), value: Map()) TestRowAccumulator(id: 1, name: Some(Test Accumulator for Sum&Count), value: Map())
SPARK UI SNAPSHOT - SPARK 用户界面快照 -
CLASS :班级 :
class TestRowAccumulator extends AccumulatorV2[Row,Map[String,Map[String,Int]]]{
private var colMetrics: Map[String, Map[String, Int]] = Map[String , Map[String , Int]]()
override def isZero: Boolean = this.colMetrics.isEmpty
override def copy(): AccumulatorV2[Row, Map[String,Map[String,Int]]] = {
val racc = new TestRowAccumulator
racc.colMetrics = colMetrics
racc
}
override def reset(): Unit = {
colMetrics = Map[String,Map[String,Int]]()
}
override def add(v: Row): Unit = {
v.schema.foreach(field => {
val name: String = field.name
val value: Int = v.getAs[Int](name)
if(!colMetrics.contains(name))
{
colMetrics = colMetrics ++ Map(name -> Map("sum" -> value , "count" -> 1 ))
}else
{
val metric = colMetrics(name)
val sum = metric("sum") + value
val count = metric("count") + 1
colMetrics = colMetrics ++ Map(name -> Map("sum" -> sum , "count" -> count))
}
})
}
override def merge(other: AccumulatorV2[Row, Map[String,Map[String,Int]]]): Unit = {
other match {
case t:TestRowAccumulator => {
colMetrics.map(col => {
val map2: Map[String, Int] = t.colMetrics.getOrElse(col._1 , Map())
val map1: Map[String, Int] = col._2
map1 ++ map2.map{ case (k,v) => k -> (v + map1.getOrElse(k,0)) }
} )
}
case _ => throw new UnsupportedOperationException(s"Cannot merge ${this.getClass.getName} with ${other.getClass.getName}")
}
}
override def value: Map[String, Map[String, Int]] = {
colMetrics
}
}
After a bit of debug , I found that merge function is being called .经过一些调试,我发现正在调用合并函数。 It had erroneous code so the accumulable value was Map()
它有错误的代码,所以可累积的值是 Map()
EXECUTION FlOW OF ACCUMULATOR (LOCAL MODE) : ADD ADD MERGE累加器的执行流程(本地模式):添加添加合并
Once I corrected the merge function , accumulator worked as expected一旦我更正了合并功能,累加器就会按预期工作
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.