this is my first Spark program. i would like to consumes Kafka messages. the messages contains byte arr, some kafka headers and the key.. the required output is parquet files with columns (kafkaKey, kafkaHeader1, kafkaHeader2, byteArr). i tried to implements it with Spark any idea how i add the schema, does the schema correct ? currently i can't control how the output will look?
...
SparkSession spark = SparkSession
.builder()
.appName("Spark Kafka")
.master("local")
.getOrCreate();
...
is this the way to create schema ?
StructType rfSchema = new StructType(new StructField[]{
new StructField("kafkaHeader1", DataTypes.StringType, false, Metadata.empty()),
new StructField("kafkaHeader2", DataTypes.StringType, false, Metadata.empty()),
new StructField("key", DataTypes.LongType, false, Metadata.empty()),
}
);
Dataset<Row> ds = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", "10.0.0.0:30526")
.option("subscribe", "test.topic")
.option("includeHeaders", "true")
.option("max.poll.records", "4000")
.option("group.id", "testSpark")
.option("key.deserializer", "org.apache.kafka.common.serialization.LongDeserializer")
.option("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
.option("startingOffsets", "earliest")
.option("failOnDataLoss", "false")
.load();
.. //i saw this line in many example, why do i need it ? ... ds.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)", "headers");
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
String currentDate= format.format(new Date());
ds.printSchema();
ds.writeStream()
.option("checkpointLocation", "/home/xxx/spark3/streamingCheckpoint")
.format("parquet")
.outputMode(OutputMode.Append())
.partitionBy("partition")
.start("home/xxx/spark3/"+currentDate);
try {
Thread.sleep(400000);
} catch (InterruptedException e) {
throw new RuntimeException(e);
}
}
...
Thanks
saw this line in many example, why do i need it ?
Because, by default, Spark doesn't deserialize your Kafka data.
You need to use a UDF function (such as CAST
) to parse the value
(and optionally key
and headers
)
Only after casting/parsing the data into a Spark StructType
, for example, will you be able to write to a structured format, such as Parquet.
Parquet should be able to hold arrays, by the way, so if you want all the headers rather than only two, use ArrayType
schema.
That being said, start with this.
ds.selectExpr("CAST(key AS LONG)", "headers")
.writeStream
Setting the deserializer
options are not valid.
From docs for values...
Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values
And see section that says "Each row in the source has the following schema" to see the data types for the rest.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.