简体   繁体   English

在使用 kafka 和 Spark 流创建直接流之前获取主题的分区数?

[英]Get number of partitions for a topic before creating a direct stream using kafka and spark streaming?

I have the following code that creates a direct stream using kafka connector for spark.我有以下代码使用 kafka 连接器为 spark 创建直接流。

public abstract class MessageConsumer<T> 
{
    public JavaInputDStream<ConsumerRecord<String, T>> createConsumer(final JavaStreamingContext jsc, 
        final Collection<String> topics, final String servers)
    {
        return KafkaUtils.createDirectStream(
            jsc,
            LocationStrategies.PreferConsistent(),
            ConsumerStrategies.<String, T>Subscribe(topics,
                ConsumerUtils.getKafkaParams(servers, getGroupId(), getDeserializerClassName())));
    }

    protected abstract String getDeserializerClassName();

    protected abstract String getGroupId();
}

This works fine, but now I want to change the logic so the consumer will consume from a specific partition of a topic, as opposed to letting Kafka decide which partition to consume from.这工作正常,但现在我想更改逻辑,以便消费者将从主题的特定分区消费,而不是让 Kafka 决定从哪个分区消费。 I do this by using the same algorithm that the default kafka partitioner uses to determine what partition to send the message to based on the key DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;我通过使用默认 kafka 分区程序用于根据密钥DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;确定将消息发送到哪个分区的相同算法来执行此DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions; . . I then simply assign my consumer to this partition.然后我简单地将我的消费者分配给这个分区。 In order for this to work, I need to know the total number of partitions available for the topic.为了使其工作,我需要知道该主题可用的分区总数。 However I do not know how to get this information using the kafka/spark streaming API.但是我不知道如何使用 kafka/spark 流 API 获取此信息。

I have been able to get this to work with other parts of my application that don't use Spark, but I am unclear of how to achieve this when using Spark.我已经能够让它与我的应用程序的其他部分不使用 Spark 一起工作,但是我不清楚在使用 Spark 时如何实现这一点。 The only way I can see to achieving this is by creating another consumer before creating the direct stream, and using it to get the total number of partitions, and then closing this consumer.我可以看到实现这一点的唯一方法是在创建直接流之前创建另一个消费者,并使用它来获取分区总数,然后关闭这个消费者。 See the below code for this implementation:有关此实现,请参阅以下代码:

public abstract class MessageConsumer<T> 
{
    public JavaInputDStream<ConsumerRecord<String, T>> createConsumer(final JavaStreamingContext jsc, 
        final String topic, final String servers, final String groundStation)
    {
        final Properties props = ConsumerUtils.getKafkaParams(servers, getGroupId(), getDeserializerClassName());
        final Consumer<String, T> tempConsumer = new KafkaConsumer<>(props);
        final int numPartitions = tempConsumer.partitionsFor(topic).size();
        final int partition = calculateKafkaPartition(groundStation.getBytes(), numPartitions);
        final TopicPartition topicPartition = new TopicPartition(topic, partition);
        tempConsumer.close();

        return KafkaUtils.createDirectStream(
            jsc,
            LocationStrategies.PreferConsistent(),
            ConsumerStrategies.<String, T>Assign(Collections.singletonList(topicPartition),
                ConsumerUtils.getKafkaParams(servers, getGroupId(), getDeserializerClassName())));
    }

    protected abstract String getDeserializerClassName();

    protected abstract String getGroupId();

    private static int calculateKafkaPartition(final byte[] keyBytes, final int numberOfPartitions)
    {
        return Utils.toPositive(Utils.murmur2(keyBytes)) % numberOfPartitions;
    }
}

This doesn't seem right to me at all, surely there is a better way to do this?这对我来说似乎根本不对,肯定有更好的方法来做到这一点吗?

You'd use Kafka's AdminClient to describe the topic.您将使用 Kafka 的 AdminClient 来描述该主题。 There's no Spark API for such information没有用于此类信息的 Spark API

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用直接流在Kafka Spark Streaming中指定使用者组 - how to specify consumer group in Kafka Spark Streaming using direct stream Spark Streaming:写入从Kafka主题读取的行数 - Spark Streaming: Writing number of rows read from a Kafka topic Kafka:使用Java更改特定主题的分区数 - Kafka : Alter number of partitions for a specific topic using java Kafka 1.0:使用Java更改特定主题的分区数 - Kafka 1.0 : Alter number of partitions for a specific topic using java spring cloud stream kafka binder 以编程方式将 kafka 主题(多个分区)偏移量重置为任意数字 - spring cloud stream kafka binder resetting a kafka topic(multiple partitions) offset to an arbitrary number programatically 使用分区在 kafka -.81 中创建/更新主题 - creating/updating topic in kafka -.81 with partitions Kafka 消费者获得特定主题的分配分区 - Kafka Consumer get assigned partitions for a specific topic 如何使用Spark结构化流为Kafka流实现自定义反序列化器? - How to implement custom deserializer for Kafka stream using Spark structured streaming? 如何将Spark和Kafka集成到直接流中 - How to integrate Spark and Kafka for direct stream Kafka和TextSocket Stream中的Spark Streaming数据传播 - Spark Streaming data dissemination in Kafka and TextSocket Stream
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM