[英]How to deserialize records from Kafka using Structured Streaming in Java?
I use Spark 2.1 . 我使用Spark 2.1 。
I am trying to read records from Kafka using Spark Structured Streaming, deserialize them and apply aggregations afterwards. 我试图使用Spark Structured Streaming从Kafka读取记录,反序列化它们并在之后应用聚合。
I have the following code: 我有以下代码:
SparkSession spark = SparkSession
.builder()
.appName("Statistics")
.getOrCreate();
Dataset<Row> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUri)
.option("subscribe", "Statistics")
.option("startingOffsets", "earliest")
.load();
df.selectExpr("CAST(value AS STRING)")
What I want is to deserialize the value
field into my object instead of casting as String
. 我想要的是将
value
字段反序列化为我的对象而不是像String
。
I have a custom deserializer for this. 我有一个自定义反序列化器。
public StatisticsRecord deserialize(String s, byte[] bytes)
How can I do this in Java? 我怎么能用Java做到这一点?
The only relevant link I have found is this https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html , but this is for Scala. 我找到的唯一相关链接是这个https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2 .html ,但这是针对Scala的。
Define schema for your JSON messages. 定义JSON消息的模式。
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("Id", DataTypes.IntegerType, false),
DataTypes.createStructField("Name", DataTypes.StringType, false),
DataTypes.createStructField("DOB", DataTypes.DateType, false) });
Now read Messages like below. 现在阅读下面的消息。 MessageData is JavaBean for your JSON message.
MessageData是JSON消息的JavaBean。
Dataset<MessageData> df = spark
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", kafkaUri)
.option("subscribe", "Statistics")
.option("startingOffsets", "earliest")
.load()
.selectExpr("CAST(value AS STRING) as message")
.select(functions.from_json(functions.col("message"),schema).as("json"))
.select("json.*")
.as(Encoders.bean(MessageData.class));
If you have a custom deserializer in Java for your data, use it on bytes that you get from Kafka after load
. 如果您的数据库中有自定义反序列化器,请在
load
后从Kafka获取的字节数上使用它。
df.select("value")
That line gives you Dataset<Row>
with just a single column value
. 该行为您提供只有一个列
value
Dataset<Row>
。
I'm exclusively with Spark API for Scala so I'd do the following in Scala to handle the "deserialization" case: 我只使用Scala的Spark API,所以我在Scala中执行以下操作来处理“反序列化”案例:
import org.apache.spark.sql.Encoders
implicit val statisticsRecordEncoder = Encoders.product[StatisticsRecord]
val myDeserializerUDF = udf { bytes => deserialize("hello", bytes) }
df.select(myDeserializerUDF($"value") as "value_des")
That should give you what you want...in Scala. 这应该会给你你想要的......在Scala中。 Converting it to Java is your home exercise :)
将它转换为Java是你的家庭练习:)
Mind that your custom object has to have an encoder available or Spark SQL will refuse to put its objects inside a Dataset. 请注意,您的自定义对象必须具有可用的编码器,否则Spark SQL将拒绝将其对象放入数据集中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.