简体   繁体   中英

How to implement custom deserializer for Kafka stream using Spark structured streaming?

I'm trying to migrate my current streaming app, which is based on using RDDs (from their documentation ) to their new Datasets API using structured streaming, which I'm told is the preferred approach to do real time streaming with Spark these days.

Currently I have the app setup to consume from 1 topic called "SATELLITE", that has messages containing a key timestamp and value containing a Satellite POJO. But I'm having problems figuring out how to implement a deserializer for this. In the my current app it is easy, you just add a line to your like kafka properties map kafkaParams.put("value.deserializer", SatelliteMessageDeserializer.class); I'm doing this in Java, which is presenting the biggest challenge, because all the solutions appear to be in Scala, which I don't understand well and I'm not easily able to convert Scala code to Java code.

I followed an example for JSON outlined in this question , which currently works, but seems overly complex for what I need to do. Given that I already have custom deserializer made for this purpose, I don't see why I should have to cast it to a string first, only to just convert it to JSON, to then convert it to my desired class type. I've also been trying to use some of the examples I found here , but I've had no luck so far.

Currently my app looks like this (using the json approach):

import common.model.Satellite;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQueryException;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class SparkStructuredStreaming  implements Runnable{

    private String bootstrapServers;
    private SparkSession session;

    public SparkStructuredStreaming(final String bootstrapServers, final SparkSession session) {
        this.bootstrapServers = bootstrapServers;
        this.session = session;
    }
    @Override
    public void run() {
        Dataset<Row> df = session
                .readStream()
                .format("kafka")
                .option("kafka.bootstrap.servers", bootstrapServers)
                .option("subscribe", "SATELLITE")
                .load();

        StructType schema =  DataTypes.createStructType(new StructField[] {
                DataTypes.createStructField("id", DataTypes.StringType, true),
                DataTypes.createStructField("gms", DataTypes.StringType, true),
                DataTypes.createStructField("satelliteId", DataTypes.StringType, true),
                DataTypes.createStructField("signalId", DataTypes.StringType, true),
                DataTypes.createStructField("cnr", DataTypes.DoubleType, true),
                DataTypes.createStructField("constellation", DataTypes.StringType, true),
                DataTypes.createStructField("timestamp", DataTypes.TimestampType, true),
                DataTypes.createStructField("mountPoint", DataTypes.StringType, true),
                DataTypes.createStructField("pseudorange", DataTypes.DoubleType, true),
                DataTypes.createStructField("epochTime", DataTypes.IntegerType, true)
        });

            Dataset<Satellite> df1 = df.selectExpr("CAST(value AS STRING) as message")
                    .select(functions.from_json(functions.col("message"),schema).as("json"))
                    .select("json.*")
                    .as(Encoders.bean(Satellite.class));

        try {
            df1.writeStream()
                    .format("console")
                    .option("truncate", "false")
                    .start()
                    .awaitTermination();

        } catch (StreamingQueryException e) {
            e.printStackTrace();
        }
    }
}

And I have a custom deserializer that looks like this

import common.model.Satellite;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.apache.kafka.common.serialization.Deserializer;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.Map;

public class SatelliteMessageDeserializer implements Deserializer<Satellite> {

    private static Logger logger = LoggerFactory.getLogger(SatelliteMessageDeserializer.class);
    private ObjectMapper objectMapper = new ObjectMapper();

    @Override
    public void configure(Map configs, boolean isKey) {
    }

    @Override
    public void close() {
    }

    @Override
    public Satellite deserialize(String topic, byte[] data) {
        try {
            return objectMapper.readValue(new String(data, "UTF-8"), getMessageClass());
        } catch (Exception e) {
            logger.error("Unable to deserialize message {}", data, e);
            return null;
        }
    }

    protected Class<Satellite> getMessageClass() {
        return Satellite.class;
    }
}

How can I use my custom deserializer from within the SparkStructuredStreaming class? I am using Spark 2.4, OpenJDK 10 and Kafka 2.0

EDIT: I've tried creating my own UDF, which I think is how this is supposed to be done, but I'm not sure how to get it to return a specific type, as it only seems to allow me to use those in the Datatypes class!

UserDefinedFunction mode = udf(
                (byte[] bytes) -> deserializer.deserialize("", bytes), DataTypes.BinaryType //Needs to be type Satellite, but only allows ones of type DataTypes
        );

Dataset df1 = df.select(mode.apply(col("value")));

from_json can only work on string typed Columns.

Structured Streaming always consumes the Kafka values as bytes

Values are always deserialized as byte arrays with ByteArrayDeserializer. Use DataFrame operations to explicitly deserialize the values

Therefore, you would first at least be deserializing to a String, but I don't think you really need that.

It might be possible to just do this

df.select(value).as(Encoders.bean(Satellite.class))

If that doesn't work, what you could try is define your own UDF/Decoder so that you could have something like SATELLITE_DECODE(value)

In scala

object SatelliteDeserializerWrapper {
    val deser = new SatelliteDeserializer
}
spark.udf.register("SATELLITE_DECODE", (topic: String, bytes: Array[Byte]) => 
    SatelliteDeserializerWrapper.deser.deserialize(topic, bytes)
)

df.selectExpr("""SATELLITE_DECODE("topic1", value) AS message""")

see this post for inspiration , and also mentioned in Databricks blog

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM