[英]Spark streaming applications subscribing to same kafka topic
I am new to spark and kafka and I have a slightly different usage pattern of spark streaming with kafka. 我是spark和kafka的新手,我和kafka的火花流的使用模式略有不同。 I am using
我在用
spark-core_2.10 - 2.1.1
spark-streaming_2.10 - 2.1.1
spark-streaming-kafka-0-10_2.10 - 2.0.0
kafka_2.10 - 0.10.1.1
Continuous event data is being streamed to a kafka topic which I need to process from multiple spark streaming applications. 连续事件数据正在流式传输到kafka主题,我需要从多个火花流应用程序处理这个主题。 But when I run the spark streaming apps, only one of them receives the data.
但是当我运行spark streaming应用程序时,只有其中一个接收数据。
Map<String, Object> kafkaParams = new HashMap<String, Object>();
kafkaParams.put("bootstrap.servers", "localhost:9092");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("auto.offset.reset", "latest");
kafkaParams.put("group.id", "test-consumer-group");
kafkaParams.put("enable.auto.commit", "true");
kafkaParams.put("auto.commit.interval.ms", "1000");
kafkaParams.put("session.timeout.ms", "30000");
Collection<String> topics = Arrays.asList("4908100105999_000005");;
JavaInputDStream<ConsumerRecord<String, String>> stream = org.apache.spark.streaming.kafka010.KafkaUtils.createDirectStream(
ssc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, String> Subscribe(topics, kafkaParams) );
... //spark processing
I have two spark streaming applications, usually the first one I submit consumes the kafka messages. 我有两个火花流应用程序,通常我提交的第一个消耗kafka消息。 Second application just waits for messages and never proceeds.
第二个应用程序只是等待消息而永远不会继续。 As I read, kafka topics can be subscribed from multiple consumers, is it not true for spark streaming ?
正如我所读到的,kafka主题可以从多个消费者订阅,对于火花流是不是真的? Or there is something I am missing with kafka topic and its configuration ?
或者kafka主题及其配置中缺少一些东西?
Thanks in advance . 提前致谢 。
You can create different streams with same groupids. 您可以使用相同的groupid创建不同的流。 Here are more details from the online documentation for 0.8 integrations, there are two approaches:
以下是0.8集成在线文档的更多细节,有两种方法:
Approach 1: Receiver-based Approach 方法1:基于接收者的方法
Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.
可以使用不同的组和主题创建多个Kafka输入DStream,以使用多个接收器并行接收数据。
Approach 2: Direct Approach (No Receivers) 方法2:直接接近(无接收器)
No need to create multiple input Kafka streams and union them.
无需创建多个输入Kafka流并将它们联合起来。 With directStream, Spark Streaming will create as many RDD partitions as there are Kafka partitions to consume, which will all read data from Kafka in parallel.
使用directStream,Spark Streaming将创建与要使用的Kafka分区一样多的RDD分区,这些分区将并行地从Kafka读取数据。 So there is a one-to-one mapping between Kafka and RDD partitions, which is easier to understand and tune.
因此,Kafka和RDD分区之间存在一对一的映射,这更容易理解和调整。
You can read more at Spark Streaming + Kafka Integration Guide 0.8 您可以在Spark Streaming + Kafka Integration Guide 0.8中阅读更多内容
From your code looks like you are using 0.10, refer Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 从你的代码看起来你使用0.10,参考Spark Streaming + Kafka集成指南(Kafka经纪人版本0.10.0)
Even thought it is using spark streaming api, everything is controlled by kafka properties so depends on group id you specify in properties file, you can start multiple streams with different group id's. 即使认为它使用spark streaming api,一切都由kafka属性控制,因此取决于您在属性文件中指定的组ID,您可以启动具有不同组ID的多个流。
Cheers ! 干杯!
Number of consumers [Under a consumer group], cannot exceed the number of partitions in the topic. 消费者数量[在消费者群体下],不能超过主题中的分区数量。 If you want to consume the messages in parallel, then you will need to introduce suitable number of partitions and create receivers to process each partition.
如果要并行使用消息,则需要引入适当数量的分区并创建接收器来处理每个分区。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.