简体   繁体   中英

Kafka connect - > S3 Parquet file byteArrey

I'm Trying to use Kafka-connect for consumes Kafka's message and write them to s3 parquet file. so i wrote a simple producer which produce messages with byte[]

Properties propertiesAWS = new Properties();
    propertiesAWS.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "myKafka:9092");
    propertiesAWS.setProperty(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, LongSerializer.class.getName());
    propertiesAWS.setProperty(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, ByteArraySerializer.class.getName());


    KafkaProducer<Long, byte[]> producer = new KafkaProducer<Long, byte[]>(propertiesAWS);
    Random rng = new Random();

    for (int i = 0; i < 100; i++) {
        try {
            Thread.sleep(1000);
            Headers headers = new RecordHeaders();
            headers.add(new RecordHeader("header1", "header1".getBytes()));
            headers.add(new RecordHeader("header2", "header2".getBytes()));
            ProducerRecord<Long, byte[]> recordOut = new ProducerRecord<Long, byte[]>
                    ("s3.test.topic", 1, rng.nextLong(), new byte[]{1, 2, 3}, headers);
            producer.send(recordOut);
        } catch (Exception e) {
            System.out.println(e);

        }
    }

and my kafka connect configurations is:

{
"name": "test_2_s3",
"config": {
    "connector.class": "io.confluent.connect.s3.S3SinkConnector",
    "aws.access.key.id": "XXXXXXX",
    "aws.secret.access.key": "XXXXXXXX",
    "s3.region": "eu-central-1",
    "flush.size": "5",
    "rotate.schedule.interval.ms": "10000",
    "timezone": "UTC",
    "tasks.max": "1",
    "topics": "s3.test.topic",
    "parquet.codec": "gzip",
    "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat",
    "partitioner.class": "io.confluent.connect.storage.partitioner.DefaultPartitioner",
    "storage.class": "io.confluent.connect.s3.storage.S3Storage",
    "s3.bucket.name": "test-phase1",
    "key.converter": "org.apache.kafka.connect.converters.LongConverter",
    "value.converter": "org.apache.kafka.connect.converters.ByteArrayConverter",
    "behavior.on.null.values": "ignore",
    "store.kafka.headers": "true"
}

and this is the error i got:

Caused by: java.lang.IllegalArgumentException: Avro schema must be a record. at org.apache.parquet.avro.AvroSchemaConverter.convert(AvroSchemaConverter.java:124)

where is my mistake ? do i need to use Avro even that i just want to write the byteArr + some Kafka Headers? how to configure which kafka header to write to parquet ? Thanks

The ParquetFormat writer requires Avro data with a schema of type: record , yes.

You could use a schema like this, though.

{
    "namespace": "com.stackoverflow.example",
    "name": "ByteWrapper",
    "type": "record",
    "fields": [{
        "name": "data",
        "type": "bytes"
    }]
}

configure which kafka header to write to parquet

None. Kafka headers are just bytes, and have no designated serialization format.

To use ParquetFormat, you'll need to change your producer to use KafkaAvroSerializer from Confluent and also use their AvroConverter in Connect to deserialize that data prior to writing Parquet.


If you want to write only bytes to S3, use ByteArrayFormat instead

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM