简体   繁体   English

spark Aggregationbykey-同一通话中的总和和移动平均值

[英]spark aggregatebykey - sum and running average in the same call

I am learning spark, and do not have experience in hadoop. 我正在学习火花,并且没有使用hadoop的经验。

Problem 问题

I am trying to calculate the sum and average in the same call to aggregateByKey. 我正在尝试在对AggregateByKey的同一调用中计算总和和平均值。

Let me share what I have tried so far. 让我分享到目前为止我所做的尝试。

Setup the data 设置数据

val categoryPrices = List((1, 20), (1, 25), (1, 10), (1, 45))
val categoryPricesRdd = sc.parallelize(categoryPrices)

Attempt to calculate the average in the same call to aggregateByKey. 尝试在对AggregateByKey的同一调用中计算平均值。 This does not work. 这是行不通的。

val zeroValue1 = (0, 0, 0.0) // (count, sum, average)
categoryPricesRdd.
    aggregateByKey(zeroValue1)(
        (tuple, prevPrice) => {
            val newCount = tuple._1 + 1
            val newSum = tuple._2 + prevPrice
            val newAverage = newSum/newCount
            (newCount, newSum, newAverage)
    },
    (tuple1, tuple2) => {
        val newCount1 = tuple1._1 + tuple2._1
        val newSum1 = tuple1._2 + tuple2._2
        // TRYING TO CALCULATE THE RUNNING AVERAGE HERE
        val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1))/(tuple1._1 + tuple2._1)
       (newCount1, newSum1, newAverage1)
    }
).
collect.
foreach(println)

Result: Prints a different average each time 结果:每次打印不同的平均值

  • First time: (1,(4,100,70.0)) 第一次:(1,(4,100,70.0))
  • Second time: (1,(4,100,52.0)) 第二次:(1,(4,100,52.0))

Just do the sum first, and then calculate the average in a separate operation. 只需先求和,然后在单独的操作中计算平均值。 This works. 这可行。

val zeroValue2 = (0, 0) // (count, sum, average)
categoryPricesRdd.
    aggregateByKey(zeroValue2)(
        (tuple, prevPrice) => {
            val newCount = tuple._1 + 1
            val newSum = tuple._2 + prevPrice
            (newCount, newSum)
        },
        (tuple1, tuple2) => {
            val newCount1 = tuple1._1 + tuple2._1
            val newSum1 = tuple1._2 + tuple2._2
            (newCount1, newSum1)
        }
    ).
    map(rec => {
        val category = rec._1
        val count = rec._2._1
        val sum = rec._2._2
        (category, count, sum, sum/count)
    }).
    collect.
    foreach(println)

Prints the same result every time: (1,4,100,25) 每次都打印相同的结果:(1,4,100,25)

I think I understand the difference between seqOp and CombOp . 我想我了解seqOp和CombOp之间的区别。 Given that an operation can split data across multiple partitions on different servers, my understanding is that seqOp operates on data in a single partition, and then combOp combines data received from different partitions. 鉴于操作可以在不同服务器上的多个分区之间拆分数据,我的理解是seqOp对单个分区中的数据进行操作,然后combOp合并从不同分区接收的数据。 Please correct if this is wrong. 如果这是错误的,请更正。

However, there is something very basic that I am not understanding. 但是,有一些我不太了解的基本知识。 Looks like we can't calculate both the sum and average in the same call. 看起来我们无法在同一调用中同时计算总和和平均值。 If this is true, please help me understand why. 如果是这样,请帮助我理解原因。

The computation related to your average aggregation in seqOp : seqOpaverage聚合有关的计算:

val newAverage = newSum/newCount

and in combOp : 并在combOp

val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1)) / (tuple1._1 + tuple2._1)

is incorrect. 是不正确的。

Let's say the first three elements are in one partition and the last element in another. 假设前三个元素在一个分区中,最后一个元素在另一个分区中。 Your seqOp would generate the (count, sum, average) tuples as follows: 您的seqOp将生成(计数,总和,平均值)元组,如下所示:

Partition #1: [20, 25, 10]
  --> (1, 20, 20/1)
  --> (2, 45, 45/2)
  --> (3, 55, 55/3)

Partition #2: [45]
  --> (1, 45, 45/1)

Next, the cross-partition combOp would combine the 2 tuples from the two partitions to give: 接下来,跨分区combOp将合并两个分区中的2个元组,得到:

((55 * 3) + (45 * 1)) / 4
// Result: 52

As you can see from the above steps, the average value could be different if the ordering of the RDD elements or the partitioning is different. 从上述步骤可以看到,如果RDD元素的顺序或分区不同,则average可能会不同。

Your 2nd approach works, as average is by definition total sum over total count hence is better calculated after first computing the sum and count values. 您的第二种方法有效,因为按定义, average是总和超过总计数,因此在先计算总和和计数值之后可以更好地进行计算。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM