简体   繁体   English

如何在以kafka为来源的Spark结构化流中识别消息的来源?

[英]How to identify the origin of messages in spark structured streaming with kafka as a source?

I have a use case in which I have to subscribe to multiple topics in kafka in spark structured streaming . 我有一个用例,其中我必须在Spark 结构化流中订阅kafka中的多个主题。 Then I have to parse each message and form a delta lake table out of it. 然后,我必须解析每个消息,并从中组成一个三角洲湖泊表。 I have made the parser and the messages(in form of xml) correctly parsing and forming delta-lake table. 我已经使解析器和消息(以xml形式)正确地解析并形成了delta-lake表。 However, I am only subscribing to only one topic as of now. 但是,到目前为止,我仅订阅一个主题。 I want to subscribe to multiple topics and based on the topic, it should go to the parser dedicatedly made for this particular topic. 我想订阅多个主题,并且基于该主题,它应该转到为该特定主题专门设计的解析器。 So basically I want to identify the topic name for all the messages as they process so that I can send them to the desired parser and process further. 因此,基本上,我想在处理所有消息时为它们标识主题名称,以便可以将它们发送到所需的解析器并进一步处理。

This is how I am accessing the messages from different topics. 这就是我访问来自不同主题的消息的方式。 However, I have no idea how to identify the source of the incoming messages while processing them. 但是,我不知道如何在处理传入消息时识别它们的来源。

 val stream_dataframe = spark.readStream
  .format(ConfigSetting.getString("source"))
  .option("kafka.bootstrap.servers", ConfigSetting.getString("bootstrap_servers"))
  .option("kafka.ssl.truststore.location", ConfigSetting.getString("trustfile_location"))
  .option("kafka.ssl.truststore.password", ConfigSetting.getString("truststore_password"))
  .option("kafka.sasl.mechanism", ConfigSetting.getString("sasl_mechanism"))
  .option("kafka.security.protocol", ConfigSetting.getString("kafka_security_protocol"))
  .option("kafka.sasl.jaas.config",ConfigSetting.getString("jass_config"))
  .option("encoding",ConfigSetting.getString("encoding"))
  .option("startingOffsets",ConfigSetting.getString("starting_offset_duration"))
  .option("subscribe",ConfigSetting.getString("topics_name"))
  .option("failOnDataLoss",ConfigSetting.getString("fail_on_dataloss")) 
  .load()


 var cast_dataframe = stream_dataframe.select(col("value").cast(StringType))

 cast_dataframe =  cast_dataframe.withColumn("parsed_column",parser(col("value"))) // Parser is the udf, made to parse the xml from the topic. 

How can I identify the topic name of the messages as they process in spark structured streaming ? 如何在Spark结构化流中处理消息时标识消息的主题名称?

As per the official documentation (emphasis mine) 根据官方文件 (重点是我的)

Each row in the source has the following schema: 源代码中的每一行都具有以下架构:

Column Type 列类型


key binary 密钥二进制
value binary 值二进制
topic string 主题字符串
partition int 分区整数

... ...

As you see input topic is part of the output schema, and can be accessed without any special actions. 如您所见,输入主题是输出模式的一部分,无需任何特殊操作即可进行访问。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM