简体   繁体   English

Kafka Connect:读取 JSON 序列化的 Kafka 消息,转换为 Parquet 格式并持久化到 S3

[英]Kafka Connect: Read JSON serialized Kafka message, convert to Parquet format and persist in S3

I have a requirement to read JSON serialized messages from a Kafka topic, convert them to Parquet and persist in S3.我需要读取来自 Kafka 主题的 JSON 序列化消息,将它们转换为 Parquet 并保留在 S3 中。

Background背景

The official S3-Sink-Connector supports Parquet output format but:官方S3-Sink-Connector支持 Parquet output 格式,但是:

You must use the AvroConverter, ProtobufConverter, or JsonSchemaConverter with ParquetFormat for this connector.对于此连接器,您必须将 AvroConverter、ProtobufConverter 或 JsonSchemaConverter 与 ParquetFormat 一起使用。 Attempting to use the JsonConverter (with or without schemas) results in a NullPointerException and a StackOverflowException.尝试使用 JsonConverter(带或不带模式)会导致 NullPointerException 和 StackOverflowException。

And JsonSchemaConverter throws out an error if the message was not written using JSON Schema serialization . 如果消息不是使用 JSON Schema serialization 写入的,JsonSchemaConverter 会抛出错误

Problem Statement问题陈述

So, I'm looking for a way to read messages from a Kafka topic that were originally written in JSON format, somehow convert them to JSON Schema format and then plug them into the S3 connector that will write to S3 in Parquet format.因此,我正在寻找一种方法来从最初以 JSON 格式编写的 Kafka 主题读取消息,以某种方式将它们转换为 JSON 模式格式,然后将它们插入将以 Parquet 格式写入 S3 的 S3 连接器。

Or alternatively , I'm also open to alternative solutions (-that don't involve writing JAVA code-) given the main requirement (take Kafka message, put it in S3 as Parquet files).或者,考虑到主要要求(获取 Kafka 消息,将其作为 Parquet 文件放入 S3),我也愿意接受替代解决方案(不涉及编写 JAVA 代码)。 Thanks!谢谢!

PS: Changing the way that these Kafka messages are written originally (such as using JSON Schema serialization with Schema Discovery ) unfortunately is not an option for me at this time. PS:不幸的是,改变这些 Kafka 消息最初的写入方式(例如使用JSON Schema 序列化Schema Discovery )目前不是我的选择。

In general, your data is required to have a schema because Parquet needs it (the S3 parquet writer translates to Avro as an intermediate step)一般来说,您的数据需要有一个模式,因为 Parquet 需要它(S3 parquet 编写器转换为 Avro 作为中间步骤)

You could look into using this Connect transform that takes in a Schema, and attempts to apply a JSON Schema - see tests .您可以考虑使用接受模式的连接转换,并尝试应用 JSON 模式- 请参阅测试 Since this returns a Struct object, then you can try to use JsonSchemaConverter as part of the sink.由于这会返回一个Struct object,因此您可以尝试使用JsonSchemaConverter作为接收器的一部分。

But if you are just throwing random JSON data into a single topic without any consistent fields or values, then you'll have a hard time applying any schema但是,如果您只是将随机 JSON 数据扔到没有任何一致字段或值的单个主题中,那么您将很难应用任何模式

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM