简体   繁体   中英

How to `reduce` only within partitions in Spark Streaming, perhaps using combineByKey?

I have data already sorted by key into my Spark Streaming partitions by virtue of Kafka, ie keys found on one node are not found on any other nodes.

I would like to use redis and its incrby (increment by) command as a state engine and to reduce the number of requests sent to redis, I would like to partially reduce my data by doing a word count on each worker node by itself. (The key is tag+timestamp to obtain my functionality from word count). I would like to avoid shuffling and let redis take care of adding data across worker nodes.

Even when I have checked that data is cleanly split among worker nodes, .reduce(_ + _) (Scala syntax) takes a long time (several seconds vs. sub-second for map tasks), as the HashPartitioner seems to shuffle my data to a random node to add it there.

How can I write a simple word count reduce on each partitioner without triggering the shuffling step in Scala with Spark Streaming?

Note DStream objects lack some RDD methods, which are available only through the transform method.

It seems I might be able to use combineByKey . I would like to skip the mergeCombiners() step and instead leave accumulated tuples where they are. The book "Learning Spark" enigmatically says:

We can disable map-side aggregation in combineByKey() if we know that our data won't benefit from it. For example, groupByKey() disables map-side aggregation as the aggregation function (appending to a list) does not save any space. If we want to disable map-side combines, we need to specify the partitioner; for now you can just use the partitioner on the source RDD by passing rdd.partitioner.

https://www.safaribooksonline.com/library/view/learning-spark/9781449359034/ch04.html

The book then continues to supply no syntax for how to do this, nor have I had any luck with google so far.

What is worse, as far as I know, the partitioner is not set for DStream RDDs in Spark Streaming, so I don't know how to supply a partitioner to combineByKey that doesn't end up shuffling data.

Also, what does "map-side" actually mean and what consequences does mapSideCombine = false have, exactly?

The scala implementation for combineByKey can be found at https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala Look for combineByKeyWithClassTag .

If the solution involves a custom partitioner, please include also a code sample for how to apply that partitioner to the incoming DStream.

This can be done using mapPartitions , which takes a function that maps an iterator of the input RDD on one partition to an iterator over the output RDD.

To implement a word count, I map to _._2 to remove the Kafka key and then perform a fast iterator word count using foldLeft , initializing a mutable.hashMap , which then gets converted to an Iterator to form the output RDD.

val myDstream = messages
  .mapPartitions( it =>
    it.map(_._2)
    .foldLeft(new mutable.HashMap[String, Int])(
      (count, key) => count += (key -> (count.getOrElse(key, 0) + 1))
    ).toIterator
  )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM