简体繁体中英

How big can batches in Flink respectively Spark get?

原文 2020-01-16 09:55:10 9 1 apache-spark/ bigdata/ apache-flink

I am currently working on a framework for analysis application of an large scale experiment. The experiment contains about 40 instruments each generating about a GB/s with ns timestamps. The data is intended to be analysed in time chunks.
For the implemetation I would like to know how big such a "chunk" aka batch can get before Flink or Spark stop processing the data. I think it goes with out saying that I intend to recollect the processed data.

1 answers

For live data analysis

In general, there is no hard limit on how much data you can process with the systems. It all depends on how many nodes you have and what kind of a query you have.

As it sounds as you would mainly want to aggregate per instrument on a given time window, your maximum scale-out is limited to 40. That's the maximum number of machines that you could throw at your problem. Then, the question arises on how big your time chunks are/how complex the aggregations become. Assuming that your aggregation requires all data of a window to be present, then the system needs to hold 1 GB per second. So if you window is one hour, the system needs to hold at least 3.6 TB of data.

If the main memory of the machines is not sufficient, data needs to be spilled to disk, which slows down processing significantly. Spark really likes to keep all data in memory, so that would be the practical limit. Flink can spill almost all data to disk, but then disk I/O becomes a bottleneck.

If you rather need to calculate small values (like sums, averages), main memory shouldn't become an issue.

For old data analysis

When analysis old data, the system can do batch processing and have much more options to handle the volume including spilling to local disk. Spark usually shines if you can keep all data of one window in main memory. If you are not certain about that or you know it will not fit into main memory, Flink is the more scalable solution. Nevertheless, I'd expect both frameworks to work well for your use case.

I'd rather look at the ecosystem and the suit for you. Which languages do you want to use? It feels like using Jupyter notebooks or Zeppelin would work best for your rather ad-hoc analysis and data exploration. Especially if you want to use Python, I'd probably give Spark a try first.

How do you get batches of rows from Spark using pyspark

How big can my spark RDD joins be?

How can I get a 3 smallest unique rows of big csv (>10 millions rows) file with Apache Spark/PySpark?

How to change the completed batches count in spark streaming?

How to avoid queuing up of Batches in spark streaming

How to measure energy consumption of Apache Spark and Flink

Spark streaming pending batches

spark streaming failed batches

How fix expiring batches when working with Spark streaming + Kafka?

How to check if the batches are empty in Spark streaming (wordcount with socketTextStream)

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question How do you get batches of rows from Spark using pyspark How big can my spark RDD joins be? How can I get a 3 smallest unique rows of big csv (>10 millions rows) file with Apache Spark/PySpark? How to change the completed batches count in spark streaming? How to avoid queuing up of Batches in spark streaming How to measure energy consumption of Apache Spark and Flink Spark streaming pending batches spark streaming failed batches How fix expiring batches when working with Spark streaming + Kafka? How to check if the batches are empty in Spark streaming (wordcount with socketTextStream)

Related Tags

How big can batches in Flink respectively Spark get?

Question

1 answers

solution1 1 2020-01-16 10:21:35

For live data analysis

For old data analysis

solution1
1 2020-01-16 10:21:35