Is it possible to process or trigger a method/action only once per batch of records in Spark Streaming?
My usecase is to call loadConfigurations() once per DStream batch even if there are 1 to n records. Loaded config should be availabe at driver for further processing.
Ex:
batch-1: 0 records in kinesis stream - no trigger of loadConfiguration()
batch-2: 1 record in kinesis stream - loadConfiguration() called once and variables updated at driver level
batch-3: 100 records in kinesis stream - loadConfiguration() called once and variables updated at driver level
Thanks in Advance.
Not quite sure whether I have understood the exact requirement. However, based on the question description and your explanation in comments, this is something which might work:
dstream.foreachRDD { rdd =>
val config = loadConfiguration() // executed at the driver
rdd.foreach { record =>
// do stuff here. e.g. config.get(). This code is executed at the worker.
}
}
An important thing to note here is that the Config
class has to be serializable as it will be sent to workers from the driver.
Also, note that this could be an anti-pattern depending on your use-case. eg for each batch, the config object will be serialized and sent to the workers which will add network overhead depending upon the size of the config object.
I would strongly recommend to check the recommended design patterns for forEachRDD
construct and chose your approach wisely. Here is the link to same: https://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.