简体   繁体   English

如何使用 Java Spark 结构化流从 Kafka 主题正确消费

[英]How to consume correctly from Kafka topic with Java Spark structured streaming

I am new to kafka-spark streaming and trying to implement the examples from spark documentation with a Protocol buffer serializer/deserializer.我是 kafka-spark 流的新手,并尝试使用协议缓冲区序列化器/反序列化器实现 spark 文档中的示例。 So far I followed the official tutorials on到目前为止,我遵循了官方教程

https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html https://developers.google.com/protocol-buffers/docs/javatutorial https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html https://developers.google.com/protocol-buffers/docs/javatutorial

and now I stuck on with the following problem.现在我坚持解决以下问题。 This question might be similar with this post How to deserialize records from Kafka using Structured Streaming in Java?这个问题可能与这篇文章类似如何使用 Java 中的结构化流从 Kafka 反序列化记录?

I already implemented successful the serializer which writes the messages on the kafka topic.我已经成功实现了在 kafka 主题上写入消息的序列化程序。 Now the task is to consume it with spark structured streaming with a custom deserializer.现在的任务是使用带有自定义解串器的 spark 结构化流来使用它。

public class CustomDeserializer implements Deserializer<Person> {

@Override
public Person deserialize(String topic, byte[] data) {
    Person person = null;
    try {
        person = Person.parseFrom(data);

        return person;
    } catch (Exception e) {
               //ToDo
    }

    return null;
 }


Dataset<Row> dataset = sparkSession.readStream()
        .format("kafka")
        .option("kafka.bootstrap.servers", "localhost:9092")
        .option("subscribe", topic)
        .option("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
        .option("value.deserializer", "de.myproject.CustomDeserializer")
        .load()
        .select("value");

    dataset.writeStream()
        .format("console")
        .start()
        .awaitTermination();

But as output I still get the binaries.但作为输出,我仍然得到二进制文件。

-------------------------------------------
Batch: 0
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|[08 AC BD BB 09 1...|
+--------------------+

-------------------------------------------
Batch: 1
-------------------------------------------
+--------------------+
|               value|
+--------------------+
|[08 82 EF D8 08 1...|
+--------------------+

Regarding the tutorial I just need to put the option for the value.deserializer to have a human readable format关于教程,我只需要为 value.deserializer 设置一个人类可读的格式

.option("value.deserializer", "de.myproject.CustomDeserializer")

Did I miss something?我错过了什么?

Did you miss this section of the documentation?您是否错过了文档的这一部分?

Note that the following Kafka params cannot be set and the Kafka source or sink will throw an exception:请注意,无法设置以下 Kafka 参数,并且 Kafka 源或接收器将抛出异常:

  • key.deserializer : Keys are always deserialized as byte arrays with ByteArrayDeserializer. key.deserializer :键总是使用 ByteArrayDeserializer 反序列化为字节数组。 Use DataFrame operations to explicitly deserialize the keys.使用 DataFrame 操作显式反序列化键。
  • value.deserializer : Values are always deserialized as byte arrays with ByteArrayDeserializer. value.deserializer :值总是使用 ByteArrayDeserializer 反序列化为字节数组。 Use DataFrame operations to explicitly deserialize the values.使用 DataFrame 操作显式反序列化值。

You'll have to register a UDF that invokes your deserializers instead您必须注册一个 UDF 来调用您的反序列化程序

Similar to Read protobuf kafka message using spark structured streaming类似于使用 Spark 结构化流读取 protobuf kafka 消息

You need to convert byte to String datatype.您需要将字节转换为字符串数据类型。 dataset.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") dataset.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")

Then you can use functions.然后就可以使用函数了。 from_json (dataset.col("value"), StructType) to get back the actual DF. from_json (dataset.col("value"), StructType) 取回实际的 DF。

Happy Coding :)快乐编码:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM