简体   繁体   中英

Can we use Spark streaming for time based events

I have a requirement as follows

  1. There are multiple devices producing data based on the device configuration. eg, There are two devices producing data at their own intervals let's say d1 producing for every 15 min and d2 producing for every 30 min
  2. All this data will be sent to Kafka
  3. I need to consume the data and perform calculations for each device which is based on the values produced for the current hour and the first value produced in the next hour. For eg, If d1 is producing data for every 15min from 12:00 AM-1:00 AM then the calculation is based on the values produced for that hour and the first value produced from 1:00 AM-2:00 AM. If the value is not produced from 1:00AM-2:00 AM then I need to consider data from 12:00 AM-1:00 AM and save it data repository (Time series)
  4. Like this there will be 'n' number of devices and each device has its own configuration. In the above scenario device d1 and d2 are producing data for every 1 hr. There might be other devices which will be producing data for every 3 hr, 6 hr.

Currently this requirement is done in Java. Since the devices are increasing so as the computations, I would like to know if Spark/Spark Streaming can be applied to this scenario?Any articles with respect to these kind of requirements can be shared so that it will be of great help.

If, and this is a big if, the computations are going to be device-wise, you can make use of topic partitions and scale the number of partitions with the number of devices. The messages are delivered in order per partition this is the most powerful idea that you need to understand.

However, some words of caution:

  • The number of topics may increase, if you want to decrease you may need to purge the topics and start again.
  • In order to ensure that the devices are uniformly distributed, you may consider assign a guid to each device.
  • If the calculations do not involve some sort of machine learning libraries and can be done in plain java, it may be a good idea to use plain old consumers (or Streams) for this, instead of abstracting them via Spark-Streaming. The lower the level the greater the flexibility.

You can check this. https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM