简体繁体 English

Spark连续处理模式不会读取所有kafka主题分区

[英]Spark continuous processing mode does not read all kafka topic partition

原文 2019-01-10 14:22:39 5 1 apache-spark/ apache-kafka/ spark-structured-streaming/ spark-streaming-kafka

I'm experimenting with Spark's Continuous Processing mode in Structured Streaming and I'm reading from a Kafka topic with 2 partitions while the Spark application has only one executor with one core. 我正在尝试结构化流中的Spark的连续处理模式，我正在阅读带有2个分区的Kafka主题，而Spark应用程序只有一个带有一个核心的执行程序。

The application is a simple one where it simply reads from the first topic and publishes on the second one. 该应用程序是一个简单的应用程序，它只是从第一个主题读取并在第二个主题上发布。 The problem is my console-consumer that reads from the second topic it sees only messages from one partition of the first topic. 问题是我的控制台消费者从第二个主题中读取它只看到来自第一个主题的一个分区的消息。 This means my Spark application reads only messages from one partition of the topic. 这意味着我的Spark应用程序只读取来自该主题的一个分区的消息。

How can I make my Spark application read from both partitions of the topic? 如何从主题的两个分区中读取我的Spark应用程序？

Note 注意

I'm asking this question for people that might run into the same issue as me 对于那些可能与我有同样问题的人，我问这个问题

1 个解决方案

I found the answer for my question in the Spark Structured Streaming documentation in the caveats section 我在警告部分的Spark Structured Streaming文档中找到了我的问题的答案

Basically, in the continuous processing mode spark launches long running tasks that read from one partition of the topic hence as only one task per core can run, the spark application needs to have as many cores as kafka topic partitions it reads from. 基本上，在连续处理模式中，spark启动从主题的一个分区读取的长时间运行的任务，因此每个核心只能运行一个任务，spark应用程序需要具有与其读取的kafka主题分区一样多的核心。