简体   繁体   中英

How to call a method after a spark structured streaming query (Kafka)?

I need to execute some functions based on the values that I receive from topics. I'm currently using ForeachWriter to convert all the topics to a List. Now, I want to pass this List as a parameter to the methods.

This is what I have so far

def doA(mylist: List[String]) = { //something for A }
def doB(mylist: List[String]) = { //something for B }

Ans this is how I call my streaming queries

//{"s":"a","v":"2"}
//{"s":"b","v":"3"}
val readTopics = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "myTopic").load()

val schema = new StructType()
      .add("s",StringType)
      .add("v",StringType)
      
val parseStringDF = readTopics.selectExpr("CAST(value AS STRING)")

val parseDF = parseStringDF.select(from_json(col("value"), schema).as("data"))
   .select("data.*")

parseDF.writeStream
  .format("console")
  .outputMode("append")
  .start()

//fails here
val listOfTopics = parseDF.select("s").map(row => (row.getString(0))).collect.toList

//unable to call the below methods
for (t <- listOfTopics ){
    if(t == "a")
        doA(listOfTopics)
    else if (t == "b")
        doB(listOfTopics)
    else
        println("do nothing")
}

spark.streams.awaitAnyTermination() 

Questions:

  1. How can I call a stand-alone (non-streaming) method in a streaming job?
  2. I cannot use ForeachWriter here as I want to pass a SparkSession to methods and since SparkSession is not serializable, I cannot use ForeachWriter. What are the alternatives to call the methods doA and doB in parallel?

If you want to be able to collect data to a local Spark driver/executor, you need to use parseDF.write.foreachBatch , ie using a ForEachWriter

It's unclear what you need the SparkSession for within your two methods, but since they are working on non-Spark datatypes, you probably shouldn't be using a SparkSession instance, anyway

Alternatively, you should .select() and filter your topic column, then apply the functions to two "topic-a" and "topic-b" dataframes, thus parallelizing the workload. Otherwise, you would be better off just using regular KafkaConsumer from kafka-clients or kafka-streams rather than Spark

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM