简体   繁体   English

使用Kafka主题中特定分区的Spark使用流数据

[英]Stream data using Spark from a partiticular partition within Kafka topics

I have already seen a similar question as clickhere 我已经看到了与clickhere类似的问题

But still I want to know if streaming data from a particular partition not possible? 但是我仍然想知道是否无法从特定分区流式传输数据吗? I have used Kafka Consumer Strategies in Spark Streaming subscribe method . 在Spark Streaming订阅方法中使用了Kafka消费者策略

ConsumerStrategies.Subscribe[String, String](topics, kafkaParams, offsets) ConsumerStrategies.Subscribe [String,String](主题,kafkaParams,偏移量)

This is the code snippet I tried out for subscribing to topic and partition, 这是我尝试订阅主题和分区的代码片段,

val topics = Array("cdc-classic")
val topic="cdc-classic"
val partition=2;
val offsets= 
Map(new TopicPartition(topic, partition) -> 2L)//I am not clear with this line, (I tried to set topic and partition number as 2)
val stream = KafkaUtils.createDirectStream[String, String](
      ssc,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams,offsets))

But whenI run this code I get the following exception, 但是当我运行这段代码时,出现以下异常,

     Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 0.0 failed 1 times, most recent failure: Lost task 5.0 in stage 0.0 (TID 5, localhost, executor driver): org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
Caused by: org.apache.kafka.clients.consumer.OffsetOutOfRangeException: Offsets out of range with no configured reset policy for partitions: {cdc-classic-2=2}
    at org.apache.kafka.clients.consumer.internals.Fetcher.parseCompletedFetch(Fetcher.java:878)
    at org.apache.kafka.clients.consumer.internals.Fetcher.fetchedRecords(Fetcher.java:525)
    at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1110)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
    at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)

PS: cdc-classic is the topic name with 17 partitions PS:cdc-classic是具有17个分区的主题名称

Kafka's partition is Spark's parallelization unit. Kafka的分区是Spark的并行化单元。 So even if technically it would be somehow possible, it doesn't make sense since all data will be processed by a single executor. 因此,即使从技术上讲它是可行的,但由于所有数据都将由一个执行程序处理,所以这没有任何意义。 Instead of using Spark for it you can simply launch your process as KafkaConsumer : 除了使用Spark之外,您还可以简单地以KafkaConsumer启动进程:

 String topic = "foo";
 TopicPartition partition0 = new TopicPartition(topic, 0);
 TopicPartition partition1 = new TopicPartition(topic, 1);
 consumer.assign(Arrays.asList(partition0, partition1));

( https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html ) https://kafka.apache.org/0110/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html

If you want to profit of Spark automatic retries, you can simply create a Docker image with that and launch it for instance with Kubernetes with appropriate retry configuration. 如果您想从Spark自动重试中受益,您可以简单地使用该镜像创建一个Docker映像,然后使用具有适当重试配置的Kubernetes例如启动它。

Regarding to Spark, if you really want to use it, you should check what is the offset of the partition you read. 关于Spark,如果您确实要使用它,则应检查读取的分区的偏移量是多少。 Probably you provide an incorrect one and it returns you "out of range" offset message (maybe start with 0?). 可能您提供的不正确,它会返回“超出范围”的偏移量消息(也许以0开头)。

Specify the partition number and starting offset of the partition to stream data in this line , 在此行中指定分区号和分区的起始偏移量以流式传输数据,

Map(new TopicPartition(topic, partition) -> 2L)

where, 哪里,

  • partition is the partition number partition是分区号

  • 2L refers to the starting offset number of the partition. 2L表示分区的起始偏移号。

Then we can stream the data from selected partitions. 然后,我们可以流式传输来自选定分区的数据。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Stream 只从 Kafka 的一个分区获取消息 - Spark Structured Stream get messages from only one partition of Kafka 如何在Spark中读取有关Kafka主题的二进制数据 - How to read binary data on Kafka topics in Spark Spark Streaming无法阅读Kafka主题 - Spark Streaming not reading from Kafka topics 来自Spark的Kafka主题的连续使用 - Serial consumption of Kafka topics from Spark 如何使用来自 Kafka 输入的字段对 Spark 数据集进行分区 - How to partition Spark dataset using a field from Kafka input Spark Structured Streaming 生产者是在 Spark 分区之间还是仅在分区内使用 Kafka 默认分区器? - Does Spark Structured Streaming producer using the Kafka default partitioner between Spark partitions or only within partition? 如何使用 Spark Structured Streaming 将数据从 Kafka 主题流式传输到 Delta 表 - How to stream data from Kafka topic to Delta table using Spark Structured Streaming Stream 并根据时间戳值处理数据(使用 Kafka 和 Spark Streaming) - Stream and process data based on timestamp values (Using Kafka and Spark Streaming) 反序列化来自 Kafka 主题的 Spark 结构化流数据 - Deserializing Spark structured stream data from Kafka topic 使用 Spark Structured Streaming 从多个 Kafka 主题读取并写入不同接收器的最佳方式是什么? - What is the optimal way to read from multiple Kafka topics and write to different sinks using Spark Structured Streaming?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM