简体   繁体   English

Spark 结构化流 - 如何将字节值排队到 Kafka?

[英]Spark structured streaming - how to queue bytes value to Kafka?

I'm writing a Spark application that uses structured streaming.我正在编写一个使用结构化流的 Spark 应用程序。 The app reads messages from a Kafka topic topic1 , constructs a new message, serializes it to an Array[Byte] and publishes them to another Kafka topic topic2 .该应用程序从 Kafka 主题topic1读取消息,构造一条新消息,将其序列化为Array[Byte]并将它们发布到另一个 Kafka 主题topic2

The serializing to a byte array is important because I use a specific serializer/deserializer that the downstream consumer of topic2 also uses.序列化为字节数组很重要,因为我使用了topic2的下游使用者也使用的特定序列化器/反序列化器。

I've trouble producing to Kafka though.不过我在制作卡夫卡时遇到了麻烦。 I'm not even sure how to do so..there's only plenty of examples online about queueing JSON data.我什至不知道怎么做……网上只有很多关于排队 JSON 数据的例子。

The code -编码 -

case class OutputMessage(id: String, bytes: Array[Byte])

implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo

val outputMessagesDataSet: DataSet[OutputMessage] = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "server1")
  .option("subscribe", "topic1")
  .load()
  .select($"value")
  .mapPartitions{r =>
     val messages: Iterator[OutputMessage] = createMessages(r)
     messages
  }

outputMessagesDataSet
  .writeStream
  .selectExpr("CAST(id AS String) AS key", "bytes AS value")
  .format("kafka")
  .option("kafka.bootstrap.servers", "server1")
  .option("topic", "topic2")
  .option("checkpointLocation", loc)
  .trigger(trigger)
  .start
  .awaitTermination

However, that throws exception org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5;但是,这会引发异常org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5; org.apache.spark.sql.AnalysisException: cannot resolve 'id' given input columns: [value]; line 1 pos 5;

How do I queue to Kafka with id as the key and bytes as the value?如何以id为键和bytes为值排队到 Kafka?

You can check the schema of the dataframe that "collects" the message.您可以检查“收集”消息的数据框的架构。 As you are collecting only the "value" field, incoming events arrive in the following form:由于您仅收集“值”字段,因此传入事件以以下形式到达:

    +-------------------+
    | value             |
    +-------------------+
    | field1,field2,..  |
    +-------------------+
  

Yo need to query for the key as well like in the Spark documentation:您需要像 Spark 文档中一样查询密钥:

df.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .as[(String, String)]

or或者

df.select(col("key").cast(StringType), col("value").cast(StringType))

As @EmiCareOfCell44 suggested, I printed out the schema -正如@EmiCareOfCell44 所建议的那样,我打印出了架构-

If I do messagesDataSet.printSchema() then I get only one value with binary type.如果我执行messagesDataSet.printSchema()那么我只会得到一个binary类型的值。 But if I do但如果我这样做

val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "server1")
.option("subscribe", "topic1")
.load()

df.printSchema()

Then it prints然后它打印

 root
  |-- key: binary (nullable = true)
  |-- value: binary (nullable = true)
  |-- topic: string (nullable = true)
  |-- partition: integer (nullable = true)
  |-- offset: long (nullable = true)
  |-- timestamp: timestamp (nullable = true)
  |-- timestampType: integer (nullable = true)

But the Dataframe hasn't undergone the transformation that is needed, which is done in但是 Dataframe 还没有经过所需的转换,这是在

.mapPartitions{r =>
 val messages: Iterator[OutputMessage] = createMessages(r)
 messages
}

It looks like the Dataset's value has only one binary value.看起来数据集的值只有一个二进制值。

I searched for some answers here, then found this post - Value Type is binary after Spark Dataset mapGroups operation even return a String in the function我在这里搜索了一些答案,然后找到了这篇文章 - 在 Spark Dataset mapGroups 操作之后,值类型是二进制的,甚至在函数中返回一个字符串

I had an Encoder set up -我设置了一个编码器 -

implicit val encoder: Encoder[OutputMessage] = org.apache.spark.sql.Encoders.kryo

That was causing the value to be converted into binary.这导致该值被转换为二进制。 Since OutputMessage is a scala class, the Encoder isn't required, so I removed it.由于OutputMessage是一个 Scala 类,因此不需要编码器,因此我将其删除。 After that, printing out the schema showed two fields (String and bytes which is what I wanted).之后,打印出模式显示了两个字段(字符串和字节,这是我想要的)。 After that, line .selectExpr("CAST(id AS String) AS key", "bytes AS value") worked perfectly well.之后,行.selectExpr("CAST(id AS String) AS key", "bytes AS value")工作得很好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM