pySpark Kafka Direct Streaming更新Zookeeper / Kafka偏移

Question

目前，我正在使用Kafka / Zookeeper和pySpark（1.6.0）。 我已经成功创建了一个使用KafkaUtils.createDirectStream()的kafka使用者。

所有流式传输都没有问题，但是我意识到，在我消费了一些消息之后，我的Kafka主题不会更新为当前偏移量。

由于我们需要更新主题才能在此处进行监视，所以这有点奇怪。

在Spark的文档中，我找到了以下注释：

   offsetRanges = []

     def storeOffsetRanges(rdd):
         global offsetRanges
         offsetRanges = rdd.offsetRanges()
         return rdd

     def printOffsetRanges(rdd):
         for o in offsetRanges:
             print "%s %s %s %s" % (o.topic, o.partition, o.fromOffset, o.untilOffset)

     directKafkaStream\
         .transform(storeOffsetRanges)\
         .foreachRDD(printOffsetRanges)

如果您希望基于Zookeeper的Kafka监视工具显示流应用程序的进度，则可以使用此方法自己更新Zookeeper。

这是文档： http : //spark.apache.org/docs/1.6.0/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

我在Scala中找到了一个解决方案，但找不到与python等效的解决方案。 这是Scala示例： http : //geeks.aretotally.in/spark-streaming-kafka-direct-api-store-offsets-in-zk/

题

但是问题是，从那以后我如何更新Zookeeper？

Answer 1

我编写了一些函数来使用python kazoo库保存和读取Kafka偏移量。

获得Kazoo客户端单例的第一个功能：

ZOOKEEPER_SERVERS = "127.0.0.1:2181"

def get_zookeeper_instance():
    from kazoo.client import KazooClient

    if 'KazooSingletonInstance' not in globals():
        globals()['KazooSingletonInstance'] = KazooClient(ZOOKEEPER_SERVERS)
        globals()['KazooSingletonInstance'].start()
    return globals()['KazooSingletonInstance']

然后用于读取和写入偏移量：

def read_offsets(zk, topics):
    from pyspark.streaming.kafka import TopicAndPartition

    from_offsets = {}
    for topic in topics:
        for partition in zk.get_children(f'/consumers/{topic}'):
            topic_partion = TopicAndPartition(topic, int(partition))
            offset = int(zk.get(f'/consumers/{topic}/{partition}')[0])
            from_offsets[topic_partion] = offset
    return from_offsets

def save_offsets(rdd):
    zk = get_zookeeper_instance()
    for offset in rdd.offsetRanges():
        path = f"/consumers/{offset.topic}/{offset.partition}"
        zk.ensure_path(path)
        zk.set(path, str(offset.untilOffset).encode())

然后，在开始流式传输之前，您可以从zookeeper读取偏移量并将其传递给fromOffsets参数的createDirectStream ：

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils


def main(brokers="127.0.0.1:9092", topics=['test1', 'test2']):
    sc = SparkContext(appName="PythonStreamingSaveOffsets")
    ssc = StreamingContext(sc, 2)

    zk = get_zookeeper_instance()
    from_offsets = read_offsets(zk, topics)

    directKafkaStream = KafkaUtils.createDirectStream(
        ssc, topics, {"metadata.broker.list": brokers},
        fromOffsets=from_offsets)

    directKafkaStream.foreachRDD(save_offsets)


if __name__ == "__main__":
    main()

Answer 2

我遇到类似的问题。 没错，使用directStream意味着直接使用kafka低级API，它不会更新读取器偏移量。 有一些关于scala / java的示例，但是没有关于python的示例。 但您自己可以轻松完成此操作，您需要做的是：

从开头的偏移量读取
在最后保存偏移量

例如，我通过执行以下操作将每个分区的偏移量保存在redis中：

stream.foreachRDD(lambda rdd: save_offset(rdd))
def save_offset(rdd):
  ranges = rdd.offsetRanges()
  for rng in ranges:
     rng.untilOffset # save offset somewhere

然后开始时，您可以使用：

fromoffset = {}
topic_partition = TopicAndPartition(topic, partition)
fromoffset[topic_partition]= int(value) #the value of int read from where you store previously.

对于某些使用zk跟踪偏移量的工具，最好将偏移量保存在zookeeper中。 此页面： https : //community.hortonworks.com/articles/81357/manually-resetting-offset-for-a-kafka-topic.html描述了如何设置偏移量，基本上，zk节点是：/ consumers / [消费者名称] /偏移量/ [主题名称] / [分区ID]，因为我们正在使用DirectStream，因此您必须组成一个消费者名称。

pySpark Kafka Direct Streaming更新Zookeeper / Kafka偏移

问题描述

题

2 个解决方案

解决方案1
2 已采纳 2018-05-16 06:07:49

解决方案2
1 2018-01-11 10:01:29

pySpark Kafka Direct Streaming更新Zookeeper / Kafka偏移

问题描述

题

2 个解决方案

解决方案1 2 已采纳 2018-05-16 06:07:49

解决方案2 1 2018-01-11 10:01:29

解决方案1
2 已采纳 2018-05-16 06:07:49

解决方案2
1 2018-01-11 10:01:29