Why is the combiner never called in spark aggregateByKey?

Question

I am trying to understand how aggregateByKey work in spark

The example below converts

("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5))

to

(Abby,Set(5, 4))
(David,Set(5, 6))

With the code below

    val babyNamesCSV = spark.sparkContext.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)))

    babyNamesCSV.aggregateByKey(new HashSet[Int])(
        (k,v) => {
            println("start")
            println(k)
            println(v)
            println("end")
            k += v
        }, 
        (v,k) => {
            println("start2")
            println(k)
            println(v)
            println("end2")
            v ++ k
        }).map(line => {
            println(line)
            line
        }).take(100)

I observed that the combiner println never showed on sbt terminal even though the seqOp did, is there a reason why?

Answer 1

Assuming that you work in local mode (not cluster/yarn etc), the only thing I can imagine is that babyNamesCSV has only 1 partition, this can happen if you have only 1 core or you set spark.master=local[1] . In this case the combiner is never called because no partitions must be merged...

Try to set the number of partitions explicitly:

val babyNamesCSV = spark.sparkContext.parallelize(List(("David", 6), ("Abby", 4), ("David", 5), ("Abby", 5)), numSlices = 2)

Answer 2

Why don't you try adding a third element in the input with one of the keys in your data. Then, lookout for printlns from both functions.

The reason may be, the workers/executors that are not on the same machine/jvm as the driver cannot show their stdout to your driver program. Hope this helps.

Why is the combiner never called in spark aggregateByKey?

Question

2 answers

solution1
1 ACCPTED 2019-09-28 05:57:25

solution2
0 2019-09-28 04:27:57

Why is the combiner never called in spark aggregateByKey?

Question

2 answers

solution1 1 ACCPTED 2019-09-28 05:57:25

solution2 0 2019-09-28 04:27:57

solution1
1 ACCPTED 2019-09-28 05:57:25

solution2
0 2019-09-28 04:27:57