Spark Python Avro Kafka 反序列化器

Question

I have created a kafka stream in a python spark app and can parse any text that comes through it.我在 python spark 应用程序中创建了一个 kafka 流，并且可以解析通过它出现的任何文本。

            kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1})

I want to change this to be able to parse avro messages from a kafka topic.我想将其更改为能够解析来自 kafka 主题的 avro 消息。 When parsing avro messages from a file, I do it like:从文件解析 avro 消息时，我这样做：

            reader = DataFileReader(open("customer.avro", "r"), DatumReader())

I'm new to python and spark, how do I change the stream to be able to parse the avro message?我是 python 和 spark 的新手，如何更改流以解析 avro 消息？ Also how can I specify a schema to use when reading the Avro message from Kafka???另外如何指定从 Kafka 读取 Avro 消息时要使用的模式？？？ I've done all this in java before but python is confusing me.我以前在 java 中完成了所有这些，但是 python 使我感到困惑。

Edit:编辑：

I tried changing to include the avro decoder我尝试更改以包含 avro 解码器

            kafkaStream = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {topic: 1},valueDecoder=avro.io.DatumReader(schema))

but I get the following error但我收到以下错误

            TypeError: 'DatumReader' object is not callable

Answer 1

I had the same challenge - deserializing avro messages from Kafka in pyspark and solved it with the Confluent Schema Registry module's Messageserializer method, as in our case the schema is stored in a Confluent Schema Registry.我遇到了同样的挑战 - 在 pyspark 中反序列化来自 Kafka 的 avro 消息，并使用 Confluent Schema Registry 模块的 Messageserializer 方法解决了这个问题，在我们的例子中，模式存储在 Confluent Schema Registry 中。

You can find that module at https://github.com/verisign/python-confluent-schemaregistry您可以在https://github.com/verisign/python-confluent-schemaregistry找到该模块

from confluent.schemaregistry.client import CachedSchemaRegistryClient
from confluent.schemaregistry.serializers import MessageSerializer
schema_registry_client = CachedSchemaRegistryClient(url='http://xx.xxx.xxx:8081')
serializer = MessageSerializer(schema_registry_client)


# simple decode to replace Kafka-streaming's built-in decode decoding UTF8 ()
def decoder(s):
    decoded_message = serializer.decode_message(s)
    return decoded_message

kvs = KafkaUtils.createDirectStream(ssc, ["mytopic"], {"metadata.broker.list": "xxxxx:9092,yyyyy:9092"}, valueDecoder=decoder)

lines = kvs.map(lambda x: x[1])
lines.pprint()

Obviously as you can see this code is using the new, direct approach with no receivers, hence the createdDirectStream (see more at https://spark.apache.org/docs/1.5.1/streaming-kafka-integration.html )很明显，正如您所看到的，这段代码使用的是没有接收器的新的、直接的方法，因此是 createdDirectStream（更多信息请参见https://spark.apache.org/docs/1.5.1/streaming-kafka-integration.html ）

Answer 2

As mentioned by @Zoltan Fedor in the comment, the provided answer is a bit old now, as 2.5 years had passed since it was written.正如@Zoltan Fedor 在评论中所提到的，提供的答案现在有点旧了，因为它自编写以来已经过去了 2.5 年。 The confluent-kafka-python library has evolved to support the same functionality nativly. confluent-kafka-python库已经发展为原生支持相同的功能。 The only chnage needed in the given code is following.给定代码中唯一需要的更改如下。

from confluent_kafka.avro.cached_schema_registry_client import CachedSchemaRegistryClient
from confluent_kafka.avro.serializer.message_serializer import MessageSerializer

And then, you can change this line -然后，你可以改变这一行——

kvs = KafkaUtils.createDirectStream(ssc, ["mytopic"], {"metadata.broker.list": "xxxxx:9092,yyyyy:9092"}, valueDecoder=serializer.decode_message)

I had tested it and it works nicely.我已经对其进行了测试，并且效果很好。 I am adding this answer for anyone who may need it in future.我正在为将来可能需要它的任何人添加此答案。

Answer 3

If you don't consider to use Confluent Schema Registry and have a schema in a text file or dict object, you can use fastavro python package to decode Avro messages of your Kafka stream:如果您不考虑使用 Confluent Schema Registry 并且在文本文件或 dict 对象中有一个架构，您可以使用fastavro python 包来解码 Kafka 流的 Avro 消息：

from pyspark.streaming.kafka import KafkaUtils
from pyspark.streaming import StreamingContext
import io
import fastavro

def decoder(msg):
    # here should be your schema
    schema = {
      "namespace": "...",
      "type": "...",
      "name": "...",
      "fields": [
        {
          "name": "...",
          "type": "..."
        },
      ...}
    bytes_io = io.BytesIO(msg)
    bytes_io.seek(0)
    msg_decoded = fastavro.schemaless_reader(bytes_io, schema)
    return msg_decoded

session = SparkSession.builder \
                      .appName("Kafka Spark Streaming Avro example") \
                      .getOrCreate()

streaming_context = StreamingContext(sparkContext=session.sparkContext,
                                     batchDuration=5)

kafka_stream = KafkaUtils.createDirectStream(ssc=streaming_context,
                                             topics=['your_topic_1', 'your_topic_2'],
                                             kafkaParams={"metadata.broker.list": "your_kafka_broker_1,your_kafka_broker_2"},
                                             valueDecoder=decoder)

Spark Python Avro Kafka 反序列化器

问题描述

3 个解决方案

解决方案1
5 2015-11-02 22:24:45

解决方案2
3 2018-03-08 17:36:09

解决方案3
0 2020-05-09 18:45:37

Spark Python Avro Kafka 反序列化器

问题描述

3 个解决方案

解决方案1 5 2015-11-02 22:24:45

解决方案2 3 2018-03-08 17:36:09

解决方案3 0 2020-05-09 18:45:37

解决方案1
5 2015-11-02 22:24:45

解决方案2
3 2018-03-08 17:36:09

解决方案3
0 2020-05-09 18:45:37