简体   繁体   English

如何在 spark 结构化流查询(Kafka)之后调用方法?

[英]How to call a method after a spark structured streaming query (Kafka)?

I need to execute some functions based on the values that I receive from topics.我需要根据从主题收到的值执行一些功能。 I'm currently using ForeachWriter to convert all the topics to a List.我目前正在使用 ForeachWriter 将所有主题转换为列表。 Now, I want to pass this List as a parameter to the methods.现在,我想将此列表作为参数传递给方法。

This is what I have so far这是我到目前为止所拥有的

def doA(mylist: List[String]) = { //something for A }
def doB(mylist: List[String]) = { //something for B }

Ans this is how I call my streaming queries Ans 这就是我调用流媒体查询的方式

//{"s":"a","v":"2"}
//{"s":"b","v":"3"}
val readTopics = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092").option("subscribe", "myTopic").load()

val schema = new StructType()
      .add("s",StringType)
      .add("v",StringType)
      
val parseStringDF = readTopics.selectExpr("CAST(value AS STRING)")

val parseDF = parseStringDF.select(from_json(col("value"), schema).as("data"))
   .select("data.*")

parseDF.writeStream
  .format("console")
  .outputMode("append")
  .start()

//fails here
val listOfTopics = parseDF.select("s").map(row => (row.getString(0))).collect.toList

//unable to call the below methods
for (t <- listOfTopics ){
    if(t == "a")
        doA(listOfTopics)
    else if (t == "b")
        doB(listOfTopics)
    else
        println("do nothing")
}

spark.streams.awaitAnyTermination() 

Questions:问题:

  1. How can I call a stand-alone (non-streaming) method in a streaming job?如何在流式作业中调用独立(非流式)方法?
  2. I cannot use ForeachWriter here as I want to pass a SparkSession to methods and since SparkSession is not serializable, I cannot use ForeachWriter.我不能在这里使用 ForeachWriter,因为我想将 SparkSession 传递给方法,并且由于 SparkSession 不可序列化,所以我不能使用 ForeachWriter。 What are the alternatives to call the methods doA and doB in parallel?并行调用方法 doA 和 doB 的替代方法是什么?

If you want to be able to collect data to a local Spark driver/executor, you need to use parseDF.write.foreachBatch , ie using a ForEachWriter如果您希望能够将数据收集到本地 Spark 驱动程序/执行程序,则需要使用parseDF.write.foreachBatch ,即使用ForEachWriter

It's unclear what you need the SparkSession for within your two methods, but since they are working on non-Spark datatypes, you probably shouldn't be using a SparkSession instance, anyway目前还不清楚在你的两种方法中你需要 SparkSession 做什么,但由于它们正在处理非 Spark 数据类型,你可能不应该使用 SparkSession 实例,无论如何

Alternatively, you should .select() and filter your topic column, then apply the functions to two "topic-a" and "topic-b" dataframes, thus parallelizing the workload.或者,您应该.select()并过滤您的主题列,然后将这些函数应用于两个“topic-a”和“topic-b”数据框,从而并行化工作负载。 Otherwise, you would be better off just using regular KafkaConsumer from kafka-clients or kafka-streams rather than Spark否则,您最好只使用来自kafka-clientskafka-streams的常规KafkaConsumer而不是 Spark

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何将 Spark 模式应用于 Spark Structured Streaming 中基于 Kafka 主题名称的查询? - How to apply Spark schema to the query based on Kafka topic name in Spark Structured Streaming? Kafka protobuf 的 Spark 结构化流 - Spark structured streaming of Kafka protobuf 在使用 Kafka 的 Spark Structured streaming 中,Spark 如何管理多个主题的偏移量 - In Spark Structured streaming with Kafka, how spark manages offset for multiple topics Spark 结构化流 - 如何将字节值排队到 Kafka? - Spark structured streaming - how to queue bytes value to Kafka? 如何在火花结构化流中将kafka时间戳值包含为列? - How to include kafka timestamp value as columns in spark structured streaming? 如何将 Kafka 与 Spark Structured Streaming 与 MongoDB Sink 集成 - How to Integrate Kafka with Spark Structured Streaming with MongoDB Sink 如何在Kafka Direct Stream中使用Spark Structured Streaming? - How to use Spark Structured Streaming with Kafka Direct Stream? 如何在 spark 3.0 结构化流媒体中使用 kafka.group.id 和检查点以继续从 Kafka 中读取它在重启后停止的位置? - How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart? Spark 2.0.2,Kafka源和scalapb实现结构化流 - structured streaming with Spark 2.0.2, Kafka source and scalapb Spark Structured Streaming 不会在 Kafka 偏移量处重新启动 - Spark Structured Streaming not restarting at Kafka offsets
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM