[英]Get number of partitions for a topic before creating a direct stream using kafka and spark streaming?
I have the following code that creates a direct stream using kafka connector for spark.我有以下代码使用 kafka 连接器为 spark 创建直接流。
public abstract class MessageConsumer<T>
{
public JavaInputDStream<ConsumerRecord<String, T>> createConsumer(final JavaStreamingContext jsc,
final Collection<String> topics, final String servers)
{
return KafkaUtils.createDirectStream(
jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, T>Subscribe(topics,
ConsumerUtils.getKafkaParams(servers, getGroupId(), getDeserializerClassName())));
}
protected abstract String getDeserializerClassName();
protected abstract String getGroupId();
}
This works fine, but now I want to change the logic so the consumer will consume from a specific partition of a topic, as opposed to letting Kafka decide which partition to consume from.这工作正常,但现在我想更改逻辑,以便消费者将从主题的特定分区消费,而不是让 Kafka 决定从哪个分区消费。 I do this by using the same algorithm that the default kafka partitioner uses to determine what partition to send the message to based on the key
DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
我通过使用默认 kafka 分区程序用于根据密钥
DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
确定将消息发送到哪个分区的相同算法来执行此DefaultPartitioner.toPositive(Utils.murmur2(keyBytes)) % numPartitions;
. . I then simply assign my consumer to this partition.
然后我简单地将我的消费者分配给这个分区。 In order for this to work, I need to know the total number of partitions available for the topic.
为了使其工作,我需要知道该主题可用的分区总数。 However I do not know how to get this information using the kafka/spark streaming API.
但是我不知道如何使用 kafka/spark 流 API 获取此信息。
I have been able to get this to work with other parts of my application that don't use Spark, but I am unclear of how to achieve this when using Spark.我已经能够让它与我的应用程序的其他部分不使用 Spark 一起工作,但是我不清楚在使用 Spark 时如何实现这一点。 The only way I can see to achieving this is by creating another consumer before creating the direct stream, and using it to get the total number of partitions, and then closing this consumer.
我可以看到实现这一点的唯一方法是在创建直接流之前创建另一个消费者,并使用它来获取分区总数,然后关闭这个消费者。 See the below code for this implementation:
有关此实现,请参阅以下代码:
public abstract class MessageConsumer<T>
{
public JavaInputDStream<ConsumerRecord<String, T>> createConsumer(final JavaStreamingContext jsc,
final String topic, final String servers, final String groundStation)
{
final Properties props = ConsumerUtils.getKafkaParams(servers, getGroupId(), getDeserializerClassName());
final Consumer<String, T> tempConsumer = new KafkaConsumer<>(props);
final int numPartitions = tempConsumer.partitionsFor(topic).size();
final int partition = calculateKafkaPartition(groundStation.getBytes(), numPartitions);
final TopicPartition topicPartition = new TopicPartition(topic, partition);
tempConsumer.close();
return KafkaUtils.createDirectStream(
jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.<String, T>Assign(Collections.singletonList(topicPartition),
ConsumerUtils.getKafkaParams(servers, getGroupId(), getDeserializerClassName())));
}
protected abstract String getDeserializerClassName();
protected abstract String getGroupId();
private static int calculateKafkaPartition(final byte[] keyBytes, final int numberOfPartitions)
{
return Utils.toPositive(Utils.murmur2(keyBytes)) % numberOfPartitions;
}
}
This doesn't seem right to me at all, surely there is a better way to do this?这对我来说似乎根本不对,肯定有更好的方法来做到这一点吗?
You'd use Kafka's AdminClient to describe the topic.您将使用 Kafka 的 AdminClient 来描述该主题。 There's no Spark API for such information
没有用于此类信息的 Spark API
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.