简体   繁体   中英

How to write parquet files with kafka headers

this is my first Spark program. i would like to consumes Kafka messages. the messages contains byte arr, some kafka headers and the key.. the required output is parquet files with columns (kafkaKey, kafkaHeader1, kafkaHeader2, byteArr). i tried to implements it with Spark any idea how i add the schema, does the schema correct ? currently i can't control how the output will look?

...
 SparkSession spark = SparkSession
                .builder()
                .appName("Spark Kafka")
                .master("local")
                .getOrCreate();
...

is this the way to create schema ?

        StructType rfSchema = new StructType(new StructField[]{
                new StructField("kafkaHeader1", DataTypes.StringType, false, Metadata.empty()),
                new StructField("kafkaHeader2", DataTypes.StringType, false, Metadata.empty()),
                new StructField("key", DataTypes.LongType, false, Metadata.empty()),
            }
        );


        Dataset<Row> ds = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "10.0.0.0:30526")
                .option("subscribe", "test.topic")
                .option("includeHeaders", "true")
                .option("max.poll.records", "4000")
                .option("group.id", "testSpark")
                .option("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer")
                .option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
                .option("startingOffsets", "earliest")
                .option("failOnDataLoss", "false")
                .load();

.. //i saw this line in many example, why do i need it ? ... ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers");

        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
        String currentDate= format.format(new Date());

        ds.printSchema();
        ds.writeStream()
                .option("checkpointLocation", "/home/xxx/spark3/streamingCheckpoint")
                .format("parquet")
                .outputMode(OutputMode.Append())
                .partitionBy("partition")
                .start("home/xxx/spark3/"+currentDate);
        try {
            Thread.sleep(400000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }

    }
...
Thanks

saw this line in many example, why do i need it ?

Because, by default, Spark doesn't deserialize your Kafka data.

You need to use a UDF function (such as CAST ) to parse the value (and optionally key and headers )

Only after casting/parsing the data into a Spark StructType , for example, will you be able to write to a structured format, such as Parquet.

Parquet should be able to hold arrays, by the way, so if you want all the headers rather than only two, use ArrayType schema.

That being said, start with this.

ds.selectExpr("CAST(key AS LONG)", "headers")
    .writeStream

Setting the deserializer options are not valid.

From docs for values...

Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values

And see section that says "Each row in the source has the following schema" to see the data types for the rest.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM