简体繁体中英

Can we use Spark streaming for time based events

原文 2019-01-06 10:49:02 6 1 java/ apache-spark/ bigdata/ spark-streaming

I have a requirement as follows

There are multiple devices producing data based on the device configuration. eg, There are two devices producing data at their own intervals let's say d1 producing for every 15 min and d2 producing for every 30 min
All this data will be sent to Kafka
I need to consume the data and perform calculations for each device which is based on the values produced for the current hour and the first value produced in the next hour. For eg, If d1 is producing data for every 15min from 12:00 AM-1:00 AM then the calculation is based on the values produced for that hour and the first value produced from 1:00 AM-2:00 AM. If the value is not produced from 1:00AM-2:00 AM then I need to consider data from 12:00 AM-1:00 AM and save it data repository (Time series)
Like this there will be 'n' number of devices and each device has its own configuration. In the above scenario device d1 and d2 are producing data for every 1 hr. There might be other devices which will be producing data for every 3 hr, 6 hr.

Currently this requirement is done in Java. Since the devices are increasing so as the computations, I would like to know if Spark/Spark Streaming can be applied to this scenario?Any articles with respect to these kind of requirements can be shared so that it will be of great help.

1 answers

If, and this is a big if, the computations are going to be device-wise, you can make use of topic partitions and scale the number of partitions with the number of devices. The messages are delivered in order per partition this is the most powerful idea that you need to understand.

However, some words of caution:

The number of topics may increase, if you want to decrease you may need to purge the topics and start again.
In order to ensure that the devices are uniformly distributed, you may consider assign a guid to each device.
If the calculations do not involve some sort of machine learning libraries and can be done in plain java, it may be a good idea to use plain old consumers (or Streams) for this, instead of abstracting them via Spark-Streaming. The lower the level the greater the flexibility.

You can check this. https://www.confluent.io/blog/how-choose-number-topics-partitions-kafka-cluster

Can we create “record count” based window in spark streaming?

Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets;

Time Based Streaming

How to use redis in Spark Streaming?

How to use redis in Spark Streaming

Use count() as an integer in Spark Streaming

Use mapPartitionsWithIndex for DStream - Spark Streaming

What is the Use of setting Interval for checkpoint in spark streaming?

Can we rotate GC log based on Specific Time in JDK 8?

Can we throw two Exceptions by Streaming and mapping?

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Can we create “record count” based window in spark streaming? Spark - Non-time-based windows are not supported on streaming DataFrames/Datasets; Time Based Streaming How to use redis in Spark Streaming? How to use redis in Spark Streaming Use count() as an integer in Spark Streaming Use mapPartitionsWithIndex for DStream - Spark Streaming What is the Use of setting Interval for checkpoint in spark streaming? Can we rotate GC log based on Specific Time in JDK 8? Can we throw two Exceptions by Streaming and mapping?

Related Tags

Can we use Spark streaming for time based events

Question

1 answers

solution1 1 2019-01-06 10:58:16

solution1
1 2019-01-06 10:58:16