简体   繁体   中英

Spark Streaming - Kinesis - Java

Is it possible to process or trigger a method/action only once per batch of records in Spark Streaming?

My usecase is to call loadConfigurations() once per DStream batch even if there are 1 to n records. Loaded config should be availabe at driver for further processing.

Ex:

batch-1: 0 records in kinesis stream - no trigger of loadConfiguration()

batch-2: 1 record in kinesis stream - loadConfiguration() called once and variables updated at driver level

batch-3: 100 records in kinesis stream - loadConfiguration() called once and variables updated at driver level

Thanks in Advance.

Not quite sure whether I have understood the exact requirement. However, based on the question description and your explanation in comments, this is something which might work:

dstream.foreachRDD { rdd =>
  val config = loadConfiguration() //  executed at the driver
  rdd.foreach { record =>
   // do stuff here. e.g. config.get(). This code is executed at the worker.
  }
}

An important thing to note here is that the Config class has to be serializable as it will be sent to workers from the driver.

Also, note that this could be an anti-pattern depending on your use-case. eg for each batch, the config object will be serialized and sent to the workers which will add network overhead depending upon the size of the config object.

I would strongly recommend to check the recommended design patterns for forEachRDD construct and chose your approach wisely. Here is the link to same: https://spark.apache.org/docs/latest/streaming-programming-guide.html#design-patterns-for-using-foreachrdd

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM