[英]Creating a JavaPairRDD using KafkaUtils.createRDD (spark and kafka)
I'm writing a batch job to replay events from Kafka. 我正在写一个批处理作业,以重播Kafka中的事件。 Kafka v. 0.10.1.0 and spark 1.6.
Kafka v.0.10.1.0和spark 1.6。
I'm trying to use the JavaPairRDD javaPairRDD = KafkaUtils.createRDD(...) call: 我正在尝试使用JavaPairRDD javaPairRDD = KafkaUtils.createRDD(...)调用:
Properties configProperties = new Properties();
configProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.4.1.194:9092");
configProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
configProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
org.apache.kafka.clients.producer.Producer producer = new KafkaProducer(configProperties);
for (String topic : topicNames) {
List<PartitionInfo> partitionInfos = producer.partitionsFor(topic);
for (PartitionInfo partitionInfo : partitionInfos) {
log.debug("partition leader id: {}", partitionInfo.leader().id());
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
Map<String, String> kafkaParams = new HashMap();
kafkaParams.put("metadata.broker.list", "10.4.1.194:9092");
kafkaParams.put("zookeeper.connect", "10.4.1.194:2181");
kafkaParams.put("group.id", "kafka-replay");
OffsetRange[] offsetRanges = new OffsetRange[]{OffsetRange.create(topic, partitionInfo.partition(), 0, Long.MAX_VALUE)};
JavaPairRDD<String, String> javaPairRDD = KafkaUtils.createRDD(
sparkContext,
String.class,
String.class,
StringDecoder.class,
StringDecoder.class,
kafkaParams,
offsetRanges);
javaPairRDD
.map(t -> getInstrEvent(t._2))
.filter(ie -> startTimestamp <= ie.getTimestamp() && ie.getTimestamp() <= endTimestamp)
.foreach(s -> System.out.println(s));
}
}
However it fails with the error: 但是,它失败并显示以下错误:
2016-12-14 15:45:44,700 [main] ERROR com.goldenrat.analytics.KafkaToHdfsReplayMain - error
org.apache.spark.SparkException: Offsets not available on leader: OffsetRange(topic: 'sfs_create_room', partition: 0, range: [1 -> 100])
at org.apache.spark.streaming.kafka.KafkaUtils$.org$apache$spark$streaming$kaf ka$KafkaUtils$$checkOffsets(KafkaUtils.scala:200)
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$1.apply(KafkaUtils.scala:253)
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$1.apply(KafkaUtils.scala:249)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.streaming.kafka.KafkaUtils$.createRDD(KafkaUtils.scala:249)
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$3.apply(KafkaUtils.scala:338)
at org.apache.spark.streaming.kafka.KafkaUtils$$anonfun$createRDD$3.apply(KafkaUtils.scala:333)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.SparkContext.withScope(SparkContext.scala:714)
at org.apache.spark.streaming.kafka.KafkaUtils$.createRDD(KafkaUtils.scala:333)
at org.apache.spark.streaming.kafka.KafkaUtils.createRDD(KafkaUtils.scala)
at com.goldenrat.analytics.KafkaToHdfsReplayMain$KafkaToHdfsReplayJob.start(KafkaToHdfsReplayMain.java:172)
I can use other clients to connect to the broker, and fetch messages, so I know it's not the broker. 我可以使用其他客户端连接到代理并获取消息,因此我知道它不是代理。 Any help?
有什么帮助吗?
Looks like you cannot specify a non-existing offset for your range. 看起来您无法为范围指定不存在的偏移量。 I was hoping i could get all offsets by specifying 0 to Long.MAX_VALUE, but it fails if the offset is invalid with that error message.
我希望我可以通过为Long.MAX_VALUE指定0来获得所有偏移量,但是如果偏移量对该错误消息无效,则它将失败。 If I specify a valid offset (min/max) for the range, it does work.
如果我为范围指定了有效的偏移量(最小/最大),则它确实起作用。 For anyone else that stumbles upon this, you can get them with something like:
对于任何偶然发现此问题的人,您可以通过以下方式获得它们:
Properties configProperties = new Properties();
configProperties.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "10.4.1.194:9092");
configProperties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
configProperties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringSerializer");
org.apache.kafka.clients.producer.Producer producer = new KafkaProducer(configProperties);
for (String topic : topicNames) {
offsets.get(topic).getMinimum(), offsets.get(topic).getMaximum());
log.debug("doing topic: {}", topic);
List<PartitionInfo> partitionInfos = producer.partitionsFor(topic);
for (PartitionInfo partitionInfo : partitionInfos) {
TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partitionInfo.partition());
Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfo = new HashMap<>();
SimpleConsumer consumer = new SimpleConsumer("10.4.1.194", 9092, 10000, 64 * 1024, "kafka-replay");
requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.EarliestTime(), 1));
kafka.javaapi.OffsetRequest request = new kafka.javaapi.OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion(), "kafka-replay");
OffsetResponse response = consumer.getOffsetsBefore(request);
if (response.hasError()) {
log.error("error, " + response.errorCode(topic, partitionInfo.partition()));
}
long[] earliestOffsetsArray = response.offsets(topic, partitionInfo.partition());
requestInfo = new HashMap<>();
requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo(kafka.api.OffsetRequest.LatestTime(), 1));
request = new kafka.javaapi.OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion(), "kafka-replay");
response = consumer.getOffsetsBefore(request);
if (response.hasError()) {
log.error("error, " + response.errorCode(topic, partitionInfo.partition()));
}
long[] latestOffsetsArray = response.offsets(topic, partitionInfo.partition());
...
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.