简体繁体 English

Kafka主题分区为Spark流媒体

[英]Kafka topic partitions to Spark streaming

原文 2016-06-14 11:27:58 4 1 apache-spark/ apache-kafka/ spark-streaming

I have some use cases that I would like to be more clarified, about Kafka topic partitioning -> spark streaming resource utilization. 我有一些用例，我想更清楚一点，关于Kafka主题分区 - >火花流资源利用率。

I use spark standalone mode, so only settings I have are "total number of executors" and "executor memory". 我使用spark独立模式，所以我只有“执行器总数”和“执行器内存”。 As far as I know and according to documentation, way to introduce parallelism into Spark streaming is using partitioned Kafka topic -> RDD will have same number of partitions as kafka, when I use spark-kafka direct stream integration. 据我所知并根据文档，将并行性引入Spark流的方法是使用分区的Kafka主题 - >当我使用spark-kafka直接流集成时，RDD将具有与kafka相同数量的分区。

So if I have 1 partition in the topic, and 1 executor core, that core will sequentially read from Kafka. 因此，如果我在主题中有1个分区，并且有1个执行程序核心，那么该核心将从Kafka顺序读取。

What happens if I have: 如果我有：

2 partitions in the topic and only 1 executor core? 主题中有2个分区，只有1个执行器核心？ Will that core read first from one partition and then from the second one, so there will be no benefit in partitioning the topic? 该核心首先从一个分区读取，然后从第二个分区读取，因此分区主题没有任何好处吗？
2 partitions in the topic and 2 cores? 主题中有2个分区和2个核心？ Will then 1 executor core read from 1 partition, and second core from the second partition? 然后1个执行器核心从1个分区读取，第二个核心从第二个分区读取吗？
1 kafka partition and 2 executor cores? 1个kafka分区和2个执行器核心？

Thank you. 谢谢。

1 个解决方案

The basic rule is that you can scale up to the number of Kafka partitions. 基本规则是您可以扩展到 Kafka分区的数量。 If you set spark.executor.cores greater than the number of partitions, some of the threads will be idle. 如果将spark.executor.cores设置spark.executor.cores大于分区数，则某些线程将处于空闲状态。 If it's less than the number of partitions, Spark will have threads read from one partition then the other. 如果它小于分区数，Spark将从一个分区读取线程，然后从另一个分区读取。 So: 所以：

2 partitions, 1 executor: reads from one partition then then other. 2个分区，1个执行器：从一个分区读取，然后从另一个分区读取。 (I am not sure how Spark decides how much to read from each before switching) （我不确定Spark如何决定在切换之前从每个读取多少）
2p, 2c: parallel execution 2p，2c：并行执行
1p, 2c: one thread is idle 1p，2c：一个线程空闲

For case #1, note that having more partitions than executors is OK since it allows you to scale out later without having to re-partition. 对于情况＃1，请注意，具有比执行程序更多的分区是可以的，因为它允许您稍后向外扩展而无需重新分区。 The trick is to make sure that your partitions are evenly divisible by the number of executors. 诀窍是确保您的分区可以被执行程序的数量整除。 Spark has to process all the partitions before passing data onto the next step in the pipeline. 在将数据传递到管道中的下一步之前，Spark必须处理所有分区。 So, if you have 'remainder' partitions, this can slow down processing. 因此，如果您有“余数”分区，这可能会降低处理速度。 For example, 5 partitions and 4 threads => processing takes the time of 2 partitions - 4 at once then one thread running the 5th partition by itself. 例如，5个分区和4个线程=>处理需要2个分区的时间 - 一次4个，然后一个线程自己运行第5个分区。

Also note that you may also see better processing throughput if you keep the number of partitions / RDDs the same throughout the pipeline by explicitly setting the number of data partitions in functions like reduceByKey() . 另请注意，如果通过显式设置reduceByKey()等函数中的数据分区数来保持整个管道中的分区/ RDD数相同，您也可以看到更好的处理吞吐量。