Kafka connect - > S3 Parquet file byteArrey

Question

I'm Trying to use Kafka-connect for consumes Kafka's message and write them to s3 parquet file. so i wrote a simple producer which produce messages with byte[]

Properties propertiesAWS = new Properties();
    propertiesAWS.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "myKafka:9092");
    propertiesAWS.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
    propertiesAWS.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getName());


    KafkaProducer<Long, byte[]> producer = new KafkaProducer<Long, byte[]>(propertiesAWS);
    Random rng = new Random();

    for (int i = 0; i < 100; i++) {
        try {
            Thread.sleep(1000);
            Headers headers = new RecordHeaders();
            headers.add(new RecordHeader("header1", "header1".getBytes()));
            headers.add(new RecordHeader("header2", "header2".getBytes()));
            ProducerRecord<Long, byte[]> recordOut = new ProducerRecord<Long, byte[]>
                    ("s3.test.topic", 1, rng.nextLong(), new byte[]{1, 2, 3}, headers);
            producer.send(recordOut);
        } catch (Exception e) {
            System.out.println(e);

        }
    }

and my kafka connect configurations is:

{
"name": "test_2_s3",
"config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "aws.access.key.id": "XXXXXXX",
    "aws.secret.access.key": "XXXXXXXX",
    "s3.region": "eu-central-1",
    "flush.size": "5",
    "rotate.schedule.interval.ms": "10000",
    "timezone": "UTC",
    "tasks.max": "1",
    "topics": "s3.test.topic",
    "parquet.codec": "gzip",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "s3.bucket.name": "test-phase1",
    "key.converter": "org.apache.kafka.connect.converters.LongConverter",
    "value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
    "behavior.on.null.values": "ignore",
    "store.kafka.headers": "true"
}

and this is the error i got:

Caused by: java.lang.IllegalArgumentException: Avro schema must be a record. at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:124)

where is my mistake ? do i need to use Avro even that i just want to write the byteArr + some Kafka Headers? how to configure which kafka header to write to parquet ? Thanks

Answer 1

The ParquetFormat writer requires Avro data with a schema of type: record , yes.

You could use a schema like this, though.

{
    "namespace": "com.stackoverflow.example",
    "name": "ByteWrapper",
    "type": "record",
    "fields": [{
        "name": "data",
        "type": "bytes"
    }]
}

configure which kafka header to write to parquet

None. Kafka headers are just bytes, and have no designated serialization format.

To use ParquetFormat, you'll need to change your producer to use KafkaAvroSerializer from Confluent and also use their AvroConverter in Connect to deserialize that data prior to writing Parquet.

If you want to write only bytes to S3, use ByteArrayFormat instead

Kafka connect - > S3 Parquet file byteArrey

Question

1 answers

solution1
0 2022-07-07 15:42:37

Kafka connect - > S3 Parquet file byteArrey

Question

1 answers

solution1 0 2022-07-07 15:42:37

solution1
0 2022-07-07 15:42:37