简体   繁体   English

Spark批处理从多列DataFrame写入Kafka主题

[英]Spark batch write to Kafka topic from multi-column DataFrame

After the batch, Spark ETL I need to write to Kafka topic the resulting DataFrame that contains multiple different columns. 批处理之后,Spark ETL我需要向Kafka主题写入包含多个不同列的结果DataFrame。

According to the following Spark documentation https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html the Dataframe being written to Kafka should have the following mandatory column in schema: 根据以下Spark文档https://spark.apache.org/docs/2.2.0/structured-streaming-kafka-integration.html ,要写入Kafka的数据框在架构中应具有以下必填列:

value (required) string or binary 值(必需)字符串或二进制

As I mentioned previously, I have much more columns with values so I have a question - how to properly send the whole DataFrame row as a single message to Kafka topic from my Spark application? 正如我之前提到的,我有更多带有值的列,所以我有一个问题-如何将整个DataFrame行作为一条消息从我的Spark应用程序正确发送给Kafka主题? Do I need to join all of the values from all columns into the new DataFrame with a single value column(that will contain the joined value) or there is more proper way to achieve it? 我是否需要将所有列中的所有值都用一个值列(将包含所连接的值)连接到新的DataFrame中,还是有更合适的方法来实现呢?

The proper way to do that is already hinted by the docs, and doesn't really differ form what you'd do with any Kafka client - you have to serialize the payload before sending to Kafka. 文档已经暗示了执行此操作的正确方法,并且与您对任何Kafka客户端所做的操作并没有真正的不同-您必须先对有效负载进行序列化,然后再发送给Kafka。

How you you'll do that ( to_json , to_csv , Apache Avro ) depends on your business requirements - nobody can answers this but you (or your team). 如何做到这一点( to_jsonto_csvApache Avro )取决于您的业务需求-除了您(或您的团队),没人能回答这个问题。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM