简体   繁体   中英

How to identify the origin of messages in spark structured streaming with kafka as a source?

I have a use case in which I have to subscribe to multiple topics in kafka in spark structured streaming . Then I have to parse each message and form a delta lake table out of it. I have made the parser and the messages(in form of xml) correctly parsing and forming delta-lake table. However, I am only subscribing to only one topic as of now. I want to subscribe to multiple topics and based on the topic, it should go to the parser dedicatedly made for this particular topic. So basically I want to identify the topic name for all the messages as they process so that I can send them to the desired parser and process further.

This is how I am accessing the messages from different topics. However, I have no idea how to identify the source of the incoming messages while processing them.

 val stream_dataframe = spark.readStream
  .format(ConfigSetting.getString("source"))
  .option("kafka.bootstrap.servers", ConfigSetting.getString("bootstrap_servers"))
  .option("kafka.ssl.truststore.location", ConfigSetting.getString("trustfile_location"))
  .option("kafka.ssl.truststore.password", ConfigSetting.getString("truststore_password"))
  .option("kafka.sasl.mechanism", ConfigSetting.getString("sasl_mechanism"))
  .option("kafka.security.protocol", ConfigSetting.getString("kafka_security_protocol"))
  .option("kafka.sasl.jaas.config",ConfigSetting.getString("jass_config"))
  .option("encoding",ConfigSetting.getString("encoding"))
  .option("startingOffsets",ConfigSetting.getString("starting_offset_duration"))
  .option("subscribe",ConfigSetting.getString("topics_name"))
  .option("failOnDataLoss",ConfigSetting.getString("fail_on_dataloss")) 
  .load()


 var cast_dataframe = stream_dataframe.select(col("value").cast(StringType))

 cast_dataframe =  cast_dataframe.withColumn("parsed_column",parser(col("value"))) // Parser is the udf, made to parse the xml from the topic. 

How can I identify the topic name of the messages as they process in spark structured streaming ?

As per the official documentation (emphasis mine)

Each row in the source has the following schema:

Column Type


key binary
value binary
topic string
partition int

...

As you see input topic is part of the output schema, and can be accessed without any special actions.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM