[英]How to call a method after a spark structured streaming query (Kafka)?
I need to execute some functions based on the values that I receive from topics.我需要根据从主题收到的值执行一些功能。 I'm currently using ForeachWriter to convert all the topics to a List.
我目前正在使用 ForeachWriter 将所有主题转换为列表。 Now, I want to pass this List as a parameter to the methods.
现在,我想将此列表作为参数传递给方法。
This is what I have so far这是我到目前为止所拥有的
def doA(mylist: List[String]) = { //something for A }
def doB(mylist: List[String]) = { //something for B }
Ans this is how I call my streaming queries Ans 这就是我调用流媒体查询的方式
//{"s":"a","v":"2"}
//{"s":"b","v":"3"}
val readTopics = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "myTopic").load()
val schema = new StructType()
.add("s",StringType)
.add("v",StringType)
val parseStringDF = readTopics.selectExpr("CAST(value AS STRING)")
val parseDF = parseStringDF.select(from_json(col("value"), schema).as("data"))
.select("data.*")
parseDF.writeStream
.format("console")
.outputMode("append")
.start()
//fails here
val listOfTopics = parseDF.select("s").map(row => (row.getString(0))).collect.toList
//unable to call the below methods
for (t <- listOfTopics ){
if(t == "a")
doA(listOfTopics)
else if (t == "b")
doB(listOfTopics)
else
println("do nothing")
}
spark.streams.awaitAnyTermination()
Questions:问题:
If you want to be able to collect data to a local Spark driver/executor, you need to use parseDF.write.foreachBatch
, ie using a ForEachWriter
如果您希望能够将数据收集到本地 Spark 驱动程序/执行程序,则需要使用
parseDF.write.foreachBatch
,即使用ForEachWriter
It's unclear what you need the SparkSession for within your two methods, but since they are working on non-Spark datatypes, you probably shouldn't be using a SparkSession instance, anyway目前还不清楚在你的两种方法中你需要 SparkSession 做什么,但由于它们正在处理非 Spark 数据类型,你可能不应该使用 SparkSession 实例,无论如何
Alternatively, you should .select()
and filter your topic column, then apply the functions to two "topic-a" and "topic-b" dataframes, thus parallelizing the workload.或者,您应该
.select()
并过滤您的主题列,然后将这些函数应用于两个“topic-a”和“topic-b”数据框,从而并行化工作负载。 Otherwise, you would be better off just using regular KafkaConsumer
from kafka-clients
or kafka-streams
rather than Spark否则,您最好只使用来自
kafka-clients
或kafka-streams
的常规KafkaConsumer
而不是 Spark
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.