简体   繁体   中英

Should I use state computation? Spark Streaming state computation explanation

Here is my case: I receive data from different devices, which have their own signature, a timestamp and a flag. I then filter the (flag==SAVE_VALUE) in a file using a foreachRDD function, but only if it passes this condition:

(it is the first time I receive this signature)
OR
(I already have this signature && the timestamp is older than an hour)

This, until I was in a local environment, meant for me to use a Map, where I stored all the IDs and the last timestamp received. Right now I would like to move this logic in a Spark like one. How should I do it?
I feel this is a case for a stateful Dstream, but I cannot completely understand:

  • How should I store a map-like rdd in a Dstream? Or how do I create a single " map RDD "
  • How do I compare the new data arriving?

Have a look at mapWithState() , it is exactly what you want.

In the StateSpecFunction , you can determine if you want to update, keep, or remove the current state, whenever a new value arrives for the same key. You have access to both the current state and the new one, so you can do any type of comparison between the two.

It has also built-in support for timeouts, and can be partitioned to multiple executors.

You can access the global map by calling stateSnapshots() on the return value of mapWithState() . Otherwise the return value will be determined by the return values of your StateSpecFunction , per batch.

mapWithState() was added in Spark 1.6, before that there was a similar function called updateStateByKey() , which did mostly the same, but performed worse on larger datasets.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM