使用Kafka主题中特定分区的Spark使用流数据

Question

I have already seen a similar question as clickhere 我已经看到了与clickhere类似的问题

But still I want to know if streaming data from a particular partition not possible? 但是我仍然想知道是否无法从特定分区流式传输数据吗？ I have used Kafka Consumer Strategies in Spark Streaming subscribe method . 我在Spark Streaming订阅方法中使用了Kafka消费者策略 。

ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets) ConsumerStrategies.Subscribe [String，String]（主题，kafkaParams，偏移量）

This is the code snippet I tried out for subscribing to topic and partition, 这是我尝试订阅主题和分区的代码片段，

val topics = Array("cdc-classic")
val topic="cdc-classic"
val partition=2;
val offsets= 
Map(new TopicPartition(topic, partition) -> 2L)//I am not clear with this line, (I tried to set topic and partition number as 2)
val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams,offsets))

But whenI run this code I get the following exception, 但是当我运行这段代码时，出现以下异常，

     Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
Caused by: org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)

PS: cdc-classic is the topic name with 17 partitions PS：cdc-classic是具有17个分区的主题名称

Answer 1

Kafka's partition is Spark's parallelization unit. Kafka的分区是Spark的并行化单元。 So even if technically it would be somehow possible, it doesn't make sense since all data will be processed by a single executor. 因此，即使从技术上讲它是可行的，但由于所有数据都将由一个执行程序处理，所以这没有任何意义。 Instead of using Spark for it you can simply launch your process as KafkaConsumer : 除了使用Spark之外，您还可以简单地以KafkaConsumer启动进程：

 String topic = "foo";
 TopicPartition partition0 = new TopicPartition(topic, 0);
 TopicPartition partition1 = new TopicPartition(topic, 1);
 consumer.assign(Arrays.asList(partition0, partition1));

( https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html ) （ https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html ）

If you want to profit of Spark automatic retries, you can simply create a Docker image with that and launch it for instance with Kubernetes with appropriate retry configuration. 如果您想从Spark自动重试中受益，您可以简单地使用该镜像创建一个Docker映像，然后使用具有适当重试配置的Kubernetes例如启动它。

Regarding to Spark, if you really want to use it, you should check what is the offset of the partition you read. 关于Spark，如果您确实要使用它，则应检查读取的分区的偏移量是多少。 Probably you provide an incorrect one and it returns you "out of range" offset message (maybe start with 0?). 可能您提供的不正确，它会返回“超出范围”的偏移量消息（也许以0开头）。

Answer 2

Specify the partition number and starting offset of the partition to stream data in this line , 在此行中指定分区号和分区的起始偏移量以流式传输数据，

Map(new TopicPartition(topic, partition) -> 2L)

where, 哪里，

partition is the partition number partition是分区号
2L refers to the starting offset number of the partition. 2L表示分区的起始偏移号。

Then we can stream the data from selected partitions. 然后，我们可以流式传输来自选定分区的数据。

使用Kafka主题中特定分区的Spark使用流数据

问题描述

2 个解决方案

解决方案1
3 2018-06-07 07:01:32

解决方案2
1 已采纳 2018-06-07 09:21:55

使用Kafka主题中特定分区的Spark使用流数据

问题描述

2 个解决方案

解决方案1 3 2018-06-07 07:01:32

解决方案2 1 已采纳 2018-06-07 09:21:55

解决方案1
3 2018-06-07 07:01:32

解决方案2
1 已采纳 2018-06-07 09:21:55