Kafka 连接 -> S3 Parquet 文件 byteArrey

Question

I'm Trying to use Kafka-connect for consumes Kafka's message and write them to s3 parquet file.我正在尝试使用 Kafka-connect 来消耗 Kafka 的消息并将它们写入 s3 parquet 文件。 so i wrote a simple producer which produce messages with byte[]所以我写了一个简单的生产者，它用 byte[] 生成消息

Properties propertiesAWS = new Properties();
    propertiesAWS.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "myKafka:9092");
    propertiesAWS.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
    propertiesAWS.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getName());


    KafkaProducer<Long, byte[]> producer = new KafkaProducer<Long, byte[]>(propertiesAWS);
    Random rng = new Random();

    for (int i = 0; i < 100; i++) {
        try {
            Thread.sleep(1000);
            Headers headers = new RecordHeaders();
            headers.add(new RecordHeader("header1", "header1".getBytes()));
            headers.add(new RecordHeader("header2", "header2".getBytes()));
            ProducerRecord<Long, byte[]> recordOut = new ProducerRecord<Long, byte[]>
                    ("s3.test.topic", 1, rng.nextLong(), new byte[]{1, 2, 3}, headers);
            producer.send(recordOut);
        } catch (Exception e) {
            System.out.println(e);

        }
    }

and my kafka connect configurations is:我的 kafka 连接配置是：

{
"name": "test_2_s3",
"config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "aws.access.key.id": "XXXXXXX",
    "aws.secret.access.key": "XXXXXXXX",
    "s3.region": "eu-central-1",
    "flush.size": "5",
    "rotate.schedule.interval.ms": "10000",
    "timezone": "UTC",
    "tasks.max": "1",
    "topics": "s3.test.topic",
    "parquet.codec": "gzip",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "s3.bucket.name": "test-phase1",
    "key.converter": "org.apache.kafka.connect.converters.LongConverter",
    "value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
    "behavior.on.null.values": "ignore",
    "store.kafka.headers": "true"
}

and this is the error i got:这是我得到的错误：

Caused by: java.lang.IllegalArgumentException: Avro schema must be a record.引起：java.lang.IllegalArgumentException：Avro 模式必须是一条记录。 at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:124)在 org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:124)

where is my mistake ?我的错误在哪里？ do i need to use Avro even that i just want to write the byteArr + some Kafka Headers?即使我只想编写 byteArr + 一些 Kafka 标头，我是否需要使用 Avro？ how to configure which kafka header to write to parquet ?如何配置要写入镶木地板的 kafka 标头？ Thanks谢谢

Answer 1

The ParquetFormat writer requires Avro data with a schema of type: record , yes. ParquetFormat需要具有以下模式的 Avro 数据type: record ，是的。

You could use a schema like this, though.不过，您可以使用这样的模式。

{
    "namespace": "com.stackoverflow.example",
    "name": "ByteWrapper",
    "type": "record",
    "fields": [{
        "name": "data",
        "type": "bytes"
    }]
}

configure which kafka header to write to parquet配置要写入 parquet 的 kafka 标头

None.没有任何。 Kafka headers are just bytes, and have no designated serialization format. Kafka 标头只是字节，没有指定的序列化格式。

To use ParquetFormat, you'll need to change your producer to use KafkaAvroSerializer from Confluent and also use their AvroConverter in Connect to deserialize that data prior to writing Parquet.要使用 ParquetFormat，您需要将生产者更改为使用 Confluent 中的 KafkaAvroSerializer，并在编写 Parquet 之前使用 Connect 中的 AvroConverter 反序列化该数据。

If you want to write only bytes to S3, use ByteArrayFormat instead如果您只想将字节写入 S3，请改用ByteArrayFormat

Kafka 连接 -> S3 Parquet 文件 byteArrey

问题描述

1 个解决方案

解决方案1
0 2022-07-07 15:42:37

Kafka 连接 -&gt; S3 Parquet 文件 byteArrey

问题描述

1 个解决方案

解决方案1 0 2022-07-07 15:42:37

Kafka 连接 -> S3 Parquet 文件 byteArrey

解决方案1
0 2022-07-07 15:42:37