I'm trying to migrate my current streaming app, which is based on using RDDs (from their documentation ) to their new Datasets API using structured streaming, which I'm told is the preferred approach to do real time streaming with Spark these days.
Currently I have the app setup to consume from 1 topic called "SATELLITE", that has messages containing a key timestamp and value containing a Satellite
POJO. But I'm having problems figuring out how to implement a deserializer for this. In the my current app it is easy, you just add a line to your like kafka properties map kafkaParams.put("value.deserializer", SatelliteMessageDeserializer.class);
I'm doing this in Java, which is presenting the biggest challenge, because all the solutions appear to be in Scala, which I don't understand well and I'm not easily able to convert Scala code to Java code.
I followed an example for JSON outlined in this question , which currently works, but seems overly complex for what I need to do. Given that I already have custom deserializer made for this purpose, I don't see why I should have to cast it to a string first, only to just convert it to JSON, to then convert it to my desired class type. I've also been trying to use some of the examples I found here , but I've had no luck so far.
Currently my app looks like this (using the json approach):
import common.model.Satellite;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;
public class SparkStructuredStreaming implements Runnable{
private String bootstrapServers;
private SparkSession session;
public SparkStructuredStreaming(final String bootstrapServers, final SparkSession session) {
this.bootstrapServers = bootstrapServers;
this.session = session;
}
@Override
public void run() {
Dataset<Row> df = session
.readStream()
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", "SATELLITE")
.load();
StructType schema = DataTypes.createStructType(new StructField[] {
DataTypes.createStructField("id", DataTypes.StringType, true),
DataTypes.createStructField("gms", DataTypes.StringType, true),
DataTypes.createStructField("satelliteId", DataTypes.StringType, true),
DataTypes.createStructField("signalId", DataTypes.StringType, true),
DataTypes.createStructField("cnr", DataTypes.DoubleType, true),
DataTypes.createStructField("constellation", DataTypes.StringType, true),
DataTypes.createStructField("timestamp", DataTypes.TimestampType, true),
DataTypes.createStructField("mountPoint", DataTypes.StringType, true),
DataTypes.createStructField("pseudorange", DataTypes.DoubleType, true),
DataTypes.createStructField("epochTime", DataTypes.IntegerType, true)
});
Dataset<Satellite> df1 = df.selectExpr("CAST(value AS STRING) as message")
.select(functions.from_json(functions.col("message"),schema).as("json"))
.select("json.*")
.as(Encoders.bean(Satellite.class));
try {
df1.writeStream()
.format("console")
.option("truncate", "false")
.start()
.awaitTermination();
} catch (StreamingQueryException e) {
e.printStackTrace();
}
}
}
And I have a custom deserializer that looks like this
import common.model.Satellite;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.kafka.common.serialization.Deserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import java.util.Map;
public class SatelliteMessageDeserializer implements Deserializer<Satellite> {
private static Logger logger = LoggerFactory.getLogger(SatelliteMessageDeserializer.class);
private ObjectMapper objectMapper = new ObjectMapper();
@Override
public void configure(Map configs, boolean isKey) {
}
@Override
public void close() {
}
@Override
public Satellite deserialize(String topic, byte[] data) {
try {
return objectMapper.readValue(new String(data, "UTF-8"), getMessageClass());
} catch (Exception e) {
logger.error("Unable to deserialize message {}", data, e);
return null;
}
}
protected Class<Satellite> getMessageClass() {
return Satellite.class;
}
}
How can I use my custom deserializer from within the SparkStructuredStreaming
class? I am using Spark 2.4, OpenJDK 10 and Kafka 2.0
EDIT: I've tried creating my own UDF, which I think is how this is supposed to be done, but I'm not sure how to get it to return a specific type, as it only seems to allow me to use those in the Datatypes
class!
UserDefinedFunction mode = udf(
(byte[] bytes) -> deserializer.deserialize("", bytes), DataTypes.BinaryType //Needs to be type Satellite, but only allows ones of type DataTypes
);
Dataset df1 = df.select(mode.apply(col("value")));
from_json
can only work on string typed Columns.
Structured Streaming always consumes the Kafka values as bytes
Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values
Therefore, you would first at least be deserializing to a String, but I don't think you really need that.
It might be possible to just do this
df.select(value).as(Encoders.bean(Satellite.class))
If that doesn't work, what you could try is define your own UDF/Decoder so that you could have something like SATELLITE_DECODE(value)
In scala
object SatelliteDeserializerWrapper {
val deser = new SatelliteDeserializer
}
spark.udf.register("SATELLITE_DECODE", (topic: String, bytes: Array[Byte]) =>
SatelliteDeserializerWrapper.deser.deserialize(topic, bytes)
)
df.selectExpr("""SATELLITE_DECODE("topic1", value) AS message""")
see this post for inspiration , and also mentioned in Databricks blog
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.