如何使用 kafka 标头编写镶木地板文件

Question

this is my first Spark program.这是我的第一个 Spark 程序。 i would like to consumes Kafka messages.我想使用 Kafka 消息。 the messages contains byte arr, some kafka headers and the key.. the required output is parquet files with columns (kafkaKey, kafkaHeader1, kafkaHeader2, byteArr).消息包含字节 arr、一些 kafka 标头和密钥。所需的输出是带有列的镶木地板文件（kafkaKey、kafkaHeader1、kafkaHeader2、byteArr）。 i tried to implements it with Spark any idea how i add the schema, does the schema correct ?我试图用 Spark 来实现它，知道我是如何添加架构的，架构是否正确？ currently i can't control how the output will look?目前我无法控制输出的外观？

...
 SparkSession spark = SparkSession
                .builder()
                .appName("Spark Kafka")
                .master("local")
                .getOrCreate();
...

is this the way to create schema ?这是创建模式的方法吗？

        StructType rfSchema = new StructType(new StructField[]{
                new StructField("kafkaHeader1", DataTypes.StringType, false, Metadata.empty()),
                new StructField("kafkaHeader2", DataTypes.StringType, false, Metadata.empty()),
                new StructField("key", DataTypes.LongType, false, Metadata.empty()),
            }
        );


        Dataset<Row> ds = spark
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", "10.0.0.0:30526")
                .option("subscribe", "test.topic")
                .option("includeHeaders", "true")
                .option("max.poll.records", "4000")
                .option("group.id", "testSpark")
                .option("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer")
                .option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
                .option("startingOffsets", "earliest")
                .option("failOnDataLoss", "false")
                .load();

.. //i saw this line in many example, why do i need it ? .. //我在很多例子中都看到了这条线，为什么我需要它？ ... ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers"); ... ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers");

        SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
        String currentDate= format.format(new Date());

        ds.printSchema();
        ds.writeStream()
                .option("checkpointLocation", "/home/xxx/spark3/streamingCheckpoint")
                .format("parquet")
                .outputMode(OutputMode.Append())
                .partitionBy("partition")
                .start("home/xxx/spark3/"+currentDate);
        try {
            Thread.sleep(400000);
        } catch (InterruptedException e) {
            throw new RuntimeException(e);
        }

    }
...
Thanks

Answer 1

saw this line in many example, why do i need it ?在很多例子中看到这条线，为什么我需要它？

Because, by default, Spark doesn't deserialize your Kafka data.因为默认情况下，Spark 不会反序列化您的 Kafka 数据。

You need to use a UDF function (such as CAST ) to parse the value (and optionally key and headers )您需要使用 UDF 函数（例如CAST ）来解析value （以及可选的key和headers ）

Only after casting/parsing the data into a Spark StructType , for example, will you be able to write to a structured format, such as Parquet.例如，只有在将数据转换/解析为 Spark StructType之后，您才能写入结构化格式，例如 Parquet。

Parquet should be able to hold arrays, by the way, so if you want all the headers rather than only two, use ArrayType schema.顺便说一句，Parquet 应该能够保存数组，所以如果你想要所有的标题而不是只有两个，请使用ArrayType模式。

That being said, start with this.话虽如此，从这个开始。

ds.selectExpr("CAST(key AS LONG)", "headers")
    .writeStream

Setting the deserializer options are not valid.设置deserializer选项无效。

From docs for values...从值的文档...

Values are always deserialized as byte arrays with ByteArrayDeserializer.值始终使用 ByteArrayDeserializer 反序列化为字节数组。 Use DataFrame operations to explicitly deserialize the values使用 DataFrame 操作显式反序列化值

And see section that says "Each row in the source has the following schema" to see the data types for the rest.并查看“源中的每一行都有以下架构”部分以查看其余数据类型。

如何使用 kafka 标头编写镶木地板文件

问题描述

1 个解决方案

解决方案1
0 2022-06-23 18:48:13

如何使用 kafka 标头编写镶木地板文件

问题描述

1 个解决方案

解决方案1 0 2022-06-23 18:48:13

解决方案1
0 2022-06-23 18:48:13