简体   繁体   中英

How do partitions work in Spark Streaming?

I am working on performance improvement of a spark streaming application.

How the partition works in streaming environment. Is is same as loading a file in into spark or all the time it creates only one partition, making it work in only one core of the executor?

In Spark Streaming (not Structured) the partitioning works exactly as you know it from working with RDD s. You can easily check the number of partitions with

rdd.getNumPartitions

As you have also tagged spark-streaming-kafka it is worth mentioning that the number of partitions in your input DStream will match the number of partitions in the Kafka topic you are consuming.

In general, for RDDs, there are the HashPartitioner and the RangePartitioner for repartitions strategies available. You could use the HashPartitioner by

rdd.partitionBy(new HashPartitioner(2))

where rdd is a key-value paired RDD and 2 is the number of partitions.

Compared to the Structured API, RDDs also have the advantage to apply custom partitioners. For this, you can extend the Partitioner class and override the methods numPartitions and getPartitions like in the example given below:

import org.apache.spark.Partitioner

class TablePartitioner extends Partitioner {
  override def numPartitions: Int = 2
  override def getPartition(key: Any): Int = {
    val tableName = key.asInstanceOf[String]
    if(tableName == "foo") 0 // partition count start at 0
    else 1
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM