简体   繁体   English

加载Avro时,Kafka Connect S3接收器引发IllegalArgumentException

[英]Kafka Connect S3 sink throws IllegalArgumentException when loading Avro

I'm using qubole's S3 sink to load Avro data into S3 in Parquet format. 我正在使用qubole的S3接收器以Parquet格式将Avro数据加载到S3中。

In my Java application I create a producer 在我的Java应用程序中,创建一个生产者

Properties props = new Properties();
props.put("bootstrap.servers", KafkaHelper.getServers());
props.put("key.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
props.put("value.serializer", "org.apache.kafka.common.serialization.ByteArraySerializer");
return new KafkaProducer<byte[], byte[]>(props);

Then convert a GenericRecord into byte[] format: 然后将GenericRecord转换为byte[]格式:

GenericRecord avroRecord = new GenericData.Record(avroSchema);
Injection<GenericRecord, byte[]> recordInjection = GenericAvroCodecs.toBinary(avroSchema);

for (Map.Entry<String, ?> entry : map.entrySet()) {
    String key = entry.getKey();
    Object value = entry.getValue();
    avroRecord.put(key, value);
}

ProducerRecord<byte[], byte[]> record = new ProducerRecord<>(topic, recordInjection.apply(avroRecord));
producer.send(record);

I use the following values in my Kafka Connect properties: 我在Kafka Connect属性中使用以下值:

key.converter=com.qubole.streamx.ByteArrayConverter
value.converter=com.qubole.streamx.ByteArrayConverter
internal.key.converter=org.apache.kafka.connect.json.JsonConverter
internal.value.converter=org.apache.kafka.connect.json.JsonConverter
internal.key.converter.schemas.enable=false
internal.value.converter.schemas.enable=false

And the following configuration options in my file sink properties: 我的文件接收器属性中包含以下配置选项:

connector.class=com.qubole.streamx.s3.S3SinkConnector
format.class=io.confluent.connect.hdfs.parquet.ParquetFormat

When I run the connector I get the following error message: 'java.lang.IllegalArgumentException: Avro schema must be a record'. 当我运行连接器时,出现以下错误消息:'java.lang.IllegalArgumentException:Avro模式必须是一条记录。

I'm pretty new to Kafka Connect and I know that a Schema Registry server can be set up--but I don't understand whether or not the sink needs the registry to convert the Avro data to Parquet or if this is some kind of formatting or configuration problem on my end. 我是Kafka Connect的新手,我知道可以设置Schema Registry服务器-但我不知道接收器是否需要注册表将Avro数据转换为Parquet,或者这是否是某种我的格式或配置问题。 What kind of data format does a "record" refer to in the context of this error? 在此错误的上下文中,“记录”指的是哪种数据格式? Any direction or help would be much appreciated. 任何方向或帮助将不胜感激。

The ByteArrayConverter is not going to do any translation of data: instead of actually doing any serialization/deserialization, it assumes the connector knows how to handle raw byte[] data. ByteArrayConverter不会进行任何数据转换:它不是实际进行任何序列化/反序列化,而是假定连接器知道如何处理原始byte[]数据。 However, the ParquetFormat (and in fact most formats) cannot handle just raw data. 但是, ParquetFormat (实际上是大多数格式)不能仅处理原始数据。 Instead, they expect data to be deserialized and structured as a record (which you can think of as a C struct, a POJO, etc). 相反,他们希望将数据反序列化并构造为记录(您可以将其视为C结构,POJO等)。

Note that the qubole streamx README notes that ByteArrayConverter is useful in cases where you can safely copy the data directly. 请注意,qubole streamx自述文件指出,在可以安全地直接复制数据的情况下, ByteArrayConverter很有用。 Examples would be if you have the data as JSON or CSV. 例如,如果您的数据为JSON或CSV。 These don't need deserialization because the bytes for each Kafka record's value can simply be copied into the output file. 这些不需要反序列化,因为每个Kafka记录值的字节可以简单地复制到输出文件中。 This is a nice optimization in those cases, but not generally applicable to all output file formats. 在这些情况下,这是一个不错的优化,但通常不适用于所有输出文件格式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM