简体繁体 English

Spark批处理从多列DataFrame写入Kafka主题

[英]Spark batch write to Kafka topic from multi-column DataFrame

原文 2018-11-23 14:36:54 9 1 apache-spark/ apache-kafka/ apache-spark-sql

After the batch, Spark ETL I need to write to Kafka topic the resulting DataFrame that contains multiple different columns. 批处理之后，Spark ETL我需要向Kafka主题写入包含多个不同列的结果DataFrame。

According to the following Spark documentation https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html the Dataframe being written to Kafka should have the following mandatory column in schema: 根据以下Spark文档https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html ，要写入Kafka的数据框在架构中应具有以下必填列：

value (required) string or binary 值（必需）字符串或二进制

As I mentioned previously, I have much more columns with values so I have a question - how to properly send the whole DataFrame row as a single message to Kafka topic from my Spark application? 正如我之前提到的，我有更多带有值的列，所以我有一个问题-如何将整个DataFrame行作为一条消息从我的Spark应用程序正确发送给Kafka主题？ Do I need to join all of the values from all columns into the new DataFrame with a single value column(that will contain the joined value) or there is more proper way to achieve it? 我是否需要将所有列中的所有值都用一个值列（将包含所连接的值）连接到新的DataFrame中，还是有更合适的方法来实现呢？

1 个解决方案

The proper way to do that is already hinted by the docs, and doesn't really differ form what you'd do with any Kafka client - you have to serialize the payload before sending to Kafka. 文档已经暗示了执行此操作的正确方法，并且与您对任何Kafka客户端所做的操作并没有真正的不同-您必须先对有效负载进行序列化，然后再发送给Kafka。

How you you'll do that ( to_json , to_csv , Apache Avro ) depends on your business requirements - nobody can answers this but you (or your team). 如何做到这一点（ to_json ， to_csv ， Apache Avro ）取决于您的业务需求-除了您（或您的团队），没人能回答这个问题。