简体   繁体   中英

spark aggregatebykey - sum and running average in the same call

I am learning spark, and do not have experience in hadoop.

Problem

I am trying to calculate the sum and average in the same call to aggregateByKey.

Let me share what I have tried so far.

Setup the data

val categoryPrices = List((1, 20), (1, 25), (1, 10), (1, 45))
val categoryPricesRdd = sc.parallelize(categoryPrices)

Attempt to calculate the average in the same call to aggregateByKey. This does not work.

val zeroValue1 = (0, 0, 0.0) // (count, sum, average)
categoryPricesRdd.
    aggregateByKey(zeroValue1)(
        (tuple, prevPrice) => {
            val newCount = tuple._1 + 1
            val newSum = tuple._2 + prevPrice
            val newAverage = newSum/newCount
            (newCount, newSum, newAverage)
    },
    (tuple1, tuple2) => {
        val newCount1 = tuple1._1 + tuple2._1
        val newSum1 = tuple1._2 + tuple2._2
        // TRYING TO CALCULATE THE RUNNING AVERAGE HERE
        val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1))/(tuple1._1 + tuple2._1)
       (newCount1, newSum1, newAverage1)
    }
).
collect.
foreach(println)

Result: Prints a different average each time

  • First time: (1,(4,100,70.0))
  • Second time: (1,(4,100,52.0))

Just do the sum first, and then calculate the average in a separate operation. This works.

val zeroValue2 = (0, 0) // (count, sum, average)
categoryPricesRdd.
    aggregateByKey(zeroValue2)(
        (tuple, prevPrice) => {
            val newCount = tuple._1 + 1
            val newSum = tuple._2 + prevPrice
            (newCount, newSum)
        },
        (tuple1, tuple2) => {
            val newCount1 = tuple1._1 + tuple2._1
            val newSum1 = tuple1._2 + tuple2._2
            (newCount1, newSum1)
        }
    ).
    map(rec => {
        val category = rec._1
        val count = rec._2._1
        val sum = rec._2._2
        (category, count, sum, sum/count)
    }).
    collect.
    foreach(println)

Prints the same result every time: (1,4,100,25)

I think I understand the difference between seqOp and CombOp . Given that an operation can split data across multiple partitions on different servers, my understanding is that seqOp operates on data in a single partition, and then combOp combines data received from different partitions. Please correct if this is wrong.

However, there is something very basic that I am not understanding. Looks like we can't calculate both the sum and average in the same call. If this is true, please help me understand why.

The computation related to your average aggregation in seqOp :

val newAverage = newSum/newCount

and in combOp :

val newAverage1 = ((tuple1._2 * tuple1._1) + (tuple2._2 * tuple2._1)) / (tuple1._1 + tuple2._1)

is incorrect.

Let's say the first three elements are in one partition and the last element in another. Your seqOp would generate the (count, sum, average) tuples as follows:

Partition #1: [20, 25, 10]
  --> (1, 20, 20/1)
  --> (2, 45, 45/2)
  --> (3, 55, 55/3)

Partition #2: [45]
  --> (1, 45, 45/1)

Next, the cross-partition combOp would combine the 2 tuples from the two partitions to give:

((55 * 3) + (45 * 1)) / 4
// Result: 52

As you can see from the above steps, the average value could be different if the ordering of the RDD elements or the partitioning is different.

Your 2nd approach works, as average is by definition total sum over total count hence is better calculated after first computing the sum and count values.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM