Spark 结构化流 - 如何将字节值排队到 Kafka？

Question

I'm writing a Spark application that uses structured streaming.我正在编写一个使用结构化流的 Spark 应用程序。 The app reads messages from a Kafka topic topic1 , constructs a new message, serializes it to an Array[Byte] and publishes them to another Kafka topic topic2 .该应用程序从 Kafka 主题topic1读取消息，构造一条新消息，将其序列化为Array[Byte]并将它们发布到另一个 Kafka 主题topic2 。

The serializing to a byte array is important because I use a specific serializer/deserializer that the downstream consumer of topic2 also uses.序列化为字节数组很重要，因为我使用了topic2的下游使用者也使用的特定序列化器/反序列化器。

I've trouble producing to Kafka though.不过我在制作卡夫卡时遇到了麻烦。 I'm not even sure how to do so..there's only plenty of examples online about queueing JSON data.我什至不知道怎么做……网上只有很多关于排队 JSON 数据的例子。

The code -编码 -

case class OutputMessage(id: String, bytes: Array[Byte])

implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo

val outputMessagesDataSet: DataSet[OutputMessage] = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "server1")
  .option("subscribe", "topic1")
  .load()
  .select($"value")
  .mapPartitions{r =>
     val messages: Iterator[OutputMessage] = createMessages(r)
     messages
  }

outputMessagesDataSet
  .writeStream
  .selectExpr("CAST(id AS String) AS key", "bytes AS value")
  .format("kafka")
  .option("kafka.bootstrap.servers", "server1")
  .option("topic", "topic2")
  .option("checkpointLocation", loc)
  .trigger(trigger)
  .start
  .awaitTermination

However, that throws exception org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5;但是，这会引发异常org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5; org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5;

How do I queue to Kafka with id as the key and bytes as the value?如何以id为键和bytes为值排队到 Kafka？

Answer 1

You can check the schema of the dataframe that "collects" the message.您可以检查“收集”消息的数据框的架构。 As you are collecting only the "value" field, incoming events arrive in the following form:由于您仅收集“值”字段，因此传入事件以以下形式到达：

    +-------------------+
    | value             |
    +-------------------+
    | field1,field2,..  |
    +-------------------+

Yo need to query for the key as well like in the Spark documentation:您需要像 Spark 文档中一样查询密钥：

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

or或者

df.select(col("key").cast(StringType), col("value").cast(StringType))

Answer 2

As @EmiCareOfCell44 suggested, I printed out the schema -正如@EmiCareOfCell44 所建议的那样，我打印出了架构-

If I do messagesDataSet.printSchema() then I get only one value with binary type.如果我执行messagesDataSet.printSchema()那么我只会得到一个binary类型的值。 But if I do但如果我这样做

val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.load()

df.printSchema()

Then it prints然后它打印

 root
  |-- key: binary (nullable = true)
  |-- value: binary (nullable = true)
  |-- topic: string (nullable = true)
  |-- partition: integer (nullable = true)
  |-- offset: long (nullable = true)
  |-- timestamp: timestamp (nullable = true)
  |-- timestampType: integer (nullable = true)

But the Dataframe hasn't undergone the transformation that is needed, which is done in但是 Dataframe 还没有经过所需的转换，这是在

.mapPartitions{r =>
 val messages: Iterator[OutputMessage] = createMessages(r)
 messages
}

It looks like the Dataset's value has only one binary value.看起来数据集的值只有一个二进制值。

I searched for some answers here, then found this post - Value Type is binary after Spark Dataset mapGroups operation even return a String in the function我在这里搜索了一些答案，然后找到了这篇文章 - 在 Spark Dataset mapGroups 操作之后，值类型是二进制的，甚至在函数中返回一个字符串

I had an Encoder set up -我设置了一个编码器 -

implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo

That was causing the value to be converted into binary.这导致该值被转换为二进制。 Since OutputMessage is a scala class, the Encoder isn't required, so I removed it.由于OutputMessage是一个 Scala 类，因此不需要编码器，因此我将其删除。 After that, printing out the schema showed two fields (String and bytes which is what I wanted).之后，打印出模式显示了两个字段（字符串和字节，这是我想要的）。 After that, line .selectExpr("CAST(id AS String) AS key", "bytes AS value") worked perfectly well.之后，行.selectExpr("CAST(id AS String) AS key", "bytes AS value")工作得很好。

Spark 结构化流 - 如何将字节值排队到 Kafka？

问题描述

2 个解决方案

解决方案1
1 2020-09-11 08:30:29

解决方案2
0 已采纳 2020-09-12 19:55:06

Spark 结构化流 - 如何将字节值排队到 Kafka？

问题描述

2 个解决方案

解决方案1 1 2020-09-11 08:30:29

解决方案2 0 已采纳 2020-09-12 19:55:06

解决方案1
1 2020-09-11 08:30:29

解决方案2
0 已采纳 2020-09-12 19:55:06