Spark Streaming 处理倾斜的 Kafka 分区

Question

Scenario:设想：
Kafka -> Spark Streaming卡夫卡->火花流

Logic in each Spark Streaming microbatch (30 seconds):每个 Spark Streaming 微批处理中的逻辑（30 秒）：
Read Json->Parse Json->Send to Kafka读取Json->解析Json->发送到Kafka

My streaming job is reading from around 1000 Kafka topics, with around 10K Kafka partitions, the throughput was around 5 million events/s .我的流式传输工作是从大约1000 个Kafka 主题中读取数据，大约有10K个 Kafka 分区，吞吐量大约是500 万个事件/秒。

The issue is coming from uneven traffic load between Kafka partitions, some partitions have throughput about 50 times of smaller ones, this results in skewed RDD partitions (as KafkaUtils creates 1:1 mapping from Kafka partitions to Spark partitions) and really hurt the overall performance, because for each microbatch, most executors are waiting for the one that has largest load to finish, I know this by looking at Spark UI, at some point of each microbatch, there`s only a few executors have "ACTIVE" tasks, all other executors are done with their task and waiting, also by looking at task time distribution, MAX is 2.5 minute but MEDIAN is only 20 seconds.问题来自 Kafka 分区之间不均匀的流量负载，一些分区的吞吐量大约是较小分区的 50 倍，这导致 RDD 分区倾斜（因为 KafkaUtils 创建了从 Kafka 分区到 Spark 分区的 1:1 映射），并且确实损害了整体性能，因为对于每个微批处理，大多数执行程序都在等待负载最大的执行程序完成，我通过查看 Spark UI 知道这一点，在每个微批处理的某个时间点，只有少数执行程序具有“活动”任务，所有其他执行者完成了他们的任务并等待，同样通过查看任务时间分布，MAX 是 2.5 分钟，但 MEDIAN 只有 20 秒。

Notes:笔记：

Spark Streaming not Structured Streaming Spark Streaming 不是结构化的 Streaming
I am aware of this post Spark - repartition() vs coalesce() , I`m not asking about difference between repartition() or coalesce(), load is consistent, so not relevant to autoscaling or dynamic allocation also我知道这篇文章Spark - repartition() vs coalesce() ，我不是在问 repartition() 或 coalesce() 之间的区别，负载是一致的，因此也与自动缩放或动态分配无关

What I tried:我尝试了什么：

Coalesce() helps a little bit but does not remove the skewness and sometimes even worse, also comes with a higher risk to OOM on executors. Coalesce() 有一点帮助，但并不能消除偏度，有时甚至更糟，还给执行者带来了更高的 OOM 风险。
Repartition() does remove skewness but full shuffling is simply too expensive at this scale, the penalty does not payback on execution time for each batch, increasing the batch time does not work also because when batch time increases, load increases for each microbatch and the work load to shuffle increases also Repartition() 确实消除了偏度，但在这种规模下完全洗牌实在是太昂贵了，惩罚不会对每个批次的执行时间产生回报，增加批次时间也不起作用，因为当批次时间增加时，每个微批次的负载都会增加，并且洗牌的工作量也增加了

How do I make workload more evenly distributed between Spark executors so that resources are being used more efficiently?如何让 Spark 执行器之间的工作负载分布更均匀，从而更有效地使用资源？ And performance would be better?而且性能会更好？

Answer 1

I have the same issue.我有同样的问题。 you can try the minPartitoin parameter from spark 2.4.7您可以尝试 spark 2.4.7 中的minPartitoin参数

Few things which are important to highlight.一些重要的事情需要强调。

By default One Kafka partition mapped to 1 spark partition or a few from spark to one from Kafka.默认情况下，一个 Kafka 分区映射到 1 个 spark 分区或从 spark 映射到 Kafka 的几个分区。
Kafka Dataframe has start and end boundaries per partition. Kafka Dataframe 具有每个分区的开始和结束边界。
Kafka Dataframe maxMessagePerTrigger define a number of messages readed from kafka. Kafka Dataframe maxMessagePerTrigger 定义了从 kafka 读取的消息数量。
From Spark 2.4.7 also supports minParrtions parameter, which can bound one Kafka partition to multiple Kafka partitions based on offset range.从 Spark 2.4.7 开始还支持 minParrtions 参数，可以根据偏移范围将一个 Kafka 分区绑定到多个 Kafka 分区。 By default, it tries to do its best effort to split Kafka partition(offset range) evenly.默认情况下，它会尽最大努力平均分割 Kafka 分区（偏移范围）。

So using minPartitons and maxOffsetsPerTrigger you can pre-calculate a good amount of partitions.因此，使用minPartitons和maxOffsetsPerTrigger您可以预先计算大量的分区。

.option("minPartitions", partitionsNumberLoadedFromKafkaAdminAPI * splitPartitionFactor)
.option("maxOffsetsPerTrigger", maxEventsPerPartition * partitionsNumber)

maxEventsPerPartition and splitPartitionFactor defined from config. maxEventsPerPartition和splitPartitionFactor从配置中定义。

In my case, sometimes I have data spikes and my message size can be very different.就我而言，有时我会遇到数据峰值，并且我的消息大小可能会非常不同。 So I have implemented my own Streaming Source which can split kafka-partitions by exact record size and even coalesce a few kafka-parttiions on one spark.因此，我实现了自己的 Streaming Source，它可以按确切的记录大小拆分 kafka 分区，甚至可以在一个 spark 上合并几个 kafka 分区。

Answer 2

Actually you have provided your own answer.实际上，您已经提供了自己的答案。

Do not have 1 Streaming Job reading from a 1000 topics.没有从 1000 个主题中读取的 1 个 Streaming Job。 Put those with biggest load into separate Streaming Job(s).将负载最大的那些放入单独的 Streaming Job(s)。 Reconfigure, that simple.重新配置，就这么简单。 Load balancing, queuing theory.负载均衡，排队论。

Stragglers are an issue in Spark, although a straggler takes on a slightly different trait in Spark.散乱者在 Spark 中是一个问题，尽管散乱者在 Spark 中具有略微不同的特征。

Spark Streaming 处理倾斜的 Kafka 分区

问题描述

2 个解决方案

解决方案1
1 2021-07-03 11:26:04

解决方案2
0 已采纳 2020-05-01 10:25:38

Spark Streaming 处理倾斜的 Kafka 分区

问题描述

2 个解决方案

解决方案1 1 2021-07-03 11:26:04

解决方案2 0 已采纳 2020-05-01 10:25:38

解决方案1
1 2021-07-03 11:26:04

解决方案2
0 已采纳 2020-05-01 10:25:38