如何使用 Java Spark 结构化流从 Kafka 主题正确消费

Question

I am new to kafka-spark streaming and trying to implement the examples from spark documentation with a Protocol buffer serializer/deserializer.我是 kafka-spark 流的新手，并尝试使用协议缓冲区序列化器/反序列化器实现 spark 文档中的示例。 So far I followed the official tutorials on到目前为止，我遵循了官方教程

https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html https://developers.google.com/protocol-buffers/docs/javatutorial https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html https://developers.google.com/protocol-buffers/docs/javatutorial

and now I stuck on with the following problem.现在我坚持解决以下问题。 This question might be similar with this post How to deserialize records from Kafka using Structured Streaming in Java?这个问题可能与这篇文章类似如何使用 Java 中的结构化流从 Kafka 反序列化记录？

I already implemented successful the serializer which writes the messages on the kafka topic.我已经成功实现了在 kafka 主题上写入消息的序列化程序。 Now the task is to consume it with spark structured streaming with a custom deserializer.现在的任务是使用带有自定义解串器的 spark 结构化流来使用它。

public class CustomDeserializer implements Deserializer<Person> {

@Override
public Person deserialize(String topic, byte[] data) {
    Person person = null;
    try {
        person = Person.parseFrom(data);

        return person;
    } catch (Exception e) {
               //ToDo
    }

    return null;
 }


Dataset<Row> dataset = sparkSession.readStream()
        .format("kafka")
        .option("kafka.bootstrap.servers", "localhost:9092")
        .option("subscribe", topic)
        .option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
        .option("value.deserializer", "de.myproject.CustomDeserializer")
        .load()
        .select("value");

    dataset.writeStream()
        .format("console")
        .start()
        .awaitTermination();

But as output I still get the binaries.但作为输出，我仍然得到二进制文件。

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|[08 AC BD BB 09 1...|
+--------------------+

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|[08 82 EF D8 08 1...|
+--------------------+

Regarding the tutorial I just need to put the option for the value.deserializer to have a human readable format关于教程，我只需要为 value.deserializer 设置一个人类可读的格式

.option("value.deserializer", "de.myproject.CustomDeserializer")

Did I miss something?我错过了什么？

Answer 1

Did you miss this section of the documentation?您是否错过了文档的这一部分？

Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:请注意，无法设置以下 Kafka 参数，并且 Kafka 源或接收器将抛出异常：

key.deserializer : Keys are always deserialized as byte arrays with ByteArrayDeserializer. key.deserializer ：键总是使用 ByteArrayDeserializer 反序列化为字节数组。 Use DataFrame operations to explicitly deserialize the keys.使用 DataFrame 操作显式反序列化键。

value.deserializer : Values are always deserialized as byte arrays with ByteArrayDeserializer. value.deserializer ：值总是使用 ByteArrayDeserializer 反序列化为字节数组。 Use DataFrame operations to explicitly deserialize the values.使用 DataFrame 操作显式反序列化值。

You'll have to register a UDF that invokes your deserializers instead您必须注册一个 UDF 来调用您的反序列化程序

Similar to Read protobuf kafka message using spark structured streaming类似于使用 Spark 结构化流读取 protobuf kafka 消息

Answer 2

You need to convert byte to String datatype.您需要将字节转换为字符串数据类型。 dataset.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") dataset.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Then you can use functions.然后就可以使用函数了。 from_json (dataset.col("value"), StructType) to get back the actual DF. from_json (dataset.col("value"), StructType) 取回实际的 DF。

Happy Coding :)快乐编码:)

如何使用 Java Spark 结构化流从 Kafka 主题正确消费

问题描述

2 个解决方案

解决方案1
1 已采纳 2019-07-05 02:49:04

解决方案2
0 2019-07-05 05:51:37

如何使用 Java Spark 结构化流从 Kafka 主题正确消费

问题描述

2 个解决方案

解决方案1 1 已采纳 2019-07-05 02:49:04

解决方案2 0 2019-07-05 05:51:37

解决方案1
1 已采纳 2019-07-05 02:49:04

解决方案2
0 2019-07-05 05:51:37