[英]spark aggregatebykey - sum and running average in the same call
I am learning spark, and do not have experience in hadoop. 我正在学习火花,并且没有使用hadoop的经验。
Problem 问题
I am trying to calculate the sum and average in the same call to aggregateByKey. 我正在尝试在对AggregateByKey的同一调用中计算总和和平均值。
Let me share what I have tried so far. 让我分享到目前为止我所做的尝试。
Setup the data 设置数据
val categoryPrices = List((1, 20), (1, 25), (1, 10), (1, 45))
val categoryPricesRdd = sc.parallelize(categoryPrices)
Attempt to calculate the average in the same call to aggregateByKey. 尝试在对AggregateByKey的同一调用中计算平均值。 This does not work. 这是行不通的。
val zeroValue1 = (0, 0, 0.0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue1)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
val newAverage = newSum/newCount
(newCount, newSum, newAverage)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
// TRYING TO CALCULATE THE RUNNING AVERAGE HERE
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1))/(tuple1._1 + tuple2._1)
(newCount1, newSum1, newAverage1)
}
).
collect.
foreach(println)
Result: Prints a different average each time 结果:每次打印不同的平均值
Just do the sum first, and then calculate the average in a separate operation. 只需先求和,然后在单独的操作中计算平均值。 This works. 这可行。
val zeroValue2 = (0, 0) // (count, sum, average)
categoryPricesRdd.
aggregateByKey(zeroValue2)(
(tuple, prevPrice) => {
val newCount = tuple._1 + 1
val newSum = tuple._2 + prevPrice
(newCount, newSum)
},
(tuple1, tuple2) => {
val newCount1 = tuple1._1 + tuple2._1
val newSum1 = tuple1._2 + tuple2._2
(newCount1, newSum1)
}
).
map(rec => {
val category = rec._1
val count = rec._2._1
val sum = rec._2._2
(category, count, sum, sum/count)
}).
collect.
foreach(println)
Prints the same result every time: (1,4,100,25) 每次都打印相同的结果:(1,4,100,25)
I think I understand the difference between seqOp and CombOp . 我想我了解seqOp和CombOp之间的区别。 Given that an operation can split data across multiple partitions on different servers, my understanding is that seqOp operates on data in a single partition, and then combOp combines data received from different partitions. 鉴于操作可以在不同服务器上的多个分区之间拆分数据,我的理解是seqOp对单个分区中的数据进行操作,然后combOp合并从不同分区接收的数据。 Please correct if this is wrong. 如果这是错误的,请更正。
However, there is something very basic that I am not understanding. 但是,有一些我不太了解的基本知识。 Looks like we can't calculate both the sum and average in the same call. 看起来我们无法在同一调用中同时计算总和和平均值。 If this is true, please help me understand why. 如果是这样,请帮助我理解原因。
The computation related to your average
aggregation in seqOp
: 与seqOp
的average
聚合有关的计算:
val newAverage = newSum/newCount
and in combOp
: 并在combOp
:
val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1)) / (tuple1._1 + tuple2._1)
is incorrect. 是不正确的。
Let's say the first three elements are in one partition and the last element in another. 假设前三个元素在一个分区中,最后一个元素在另一个分区中。 Your seqOp
would generate the (count, sum, average) tuples as follows: 您的seqOp
将生成(计数,总和,平均值)元组,如下所示:
Partition #1: [20, 25, 10]
--> (1, 20, 20/1)
--> (2, 45, 45/2)
--> (3, 55, 55/3)
Partition #2: [45]
--> (1, 45, 45/1)
Next, the cross-partition combOp
would combine the 2 tuples from the two partitions to give: 接下来,跨分区combOp
将合并两个分区中的2个元组,得到:
((55 * 3) + (45 * 1)) / 4
// Result: 52
As you can see from the above steps, the average
value could be different if the ordering of the RDD elements or the partitioning is different. 从上述步骤可以看到,如果RDD元素的顺序或分区不同,则average
可能会不同。
Your 2nd approach works, as average
is by definition total sum over total count hence is better calculated after first computing the sum and count values. 您的第二种方法有效,因为按定义, average
是总和超过总计数,因此在先计算总和和计数值之后可以更好地进行计算。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.