简体   繁体   English

使用foreach进行Spark结构化流

[英]Spark Structured Streaming with foreach

I am using spark structured streaming to read data from kafka. 我正在使用Spark结构化流从kafka读取数据。

val readStreamDF = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", config.getString("kafka.source.brokerList"))
  .option("startingOffsets", config.getString("kafka.source.startingOffsets"))
  .option("subscribe", config.getString("kafka.source.topic"))
  .load()

Based on an uid in the message read from kafka, I have to make an api call to an external source and fetch data and write back to another kafka topic. 基于从kafka读取的消息中的uid ,我必须对外部源进行api调用并获取数据并写回另一个kafka主题。 For this I am using a custom foreach writer and processing every message. 为此,我正在使用自定义的foreach并处理每条消息。

import spark.implicits._

val eventData = readStreamDF
  .select(from_json(col("value").cast("string"), event).alias("message"), col("timestamp"))
  .withColumn("uid", col("message.eventPayload.uid"))
  .drop("message")

val q = eventData
  .writeStream
  .format("console")
  .foreach(new CustomForEachWriter())
  .start()

The CustomForEachWriter makes an API call and fetch results against the given uid from a service. CustomForEachWriter进行API调用,并根据服务中的给定uid获取结果。 The result is an array of ids. 结果是一个id数组。 These ids are then again written back to another kafka topic via a kafka producer. 然后,这些ID通过kafka生产者再次写回另一个kafka主题。

There are 30 kafka partition and I have launched spark with following config 有30个kafka分区,我已经通过以下配置启动了spark

num-executors = 30
executors-cores = 3
executor-memory = 10GB

But still the spark job starts lagging and is not able to keep up with the incoming data rate. 但是,火花作业仍然开始滞后,无法跟上传入的数据速率。

Incoming data rate is around 10K messages per sec. 传入数据速率约为每秒1万条消息。 The avg time to process a single msg in 100ms. 平均处理单个味精的时间(以100ms为单位)。

I want to understand how spark process this in case of structured streaming. 我想了解在结构化流媒体的情况下spark是如何处理的。 In case of structured streaming there is one dedicated executor which is responsible for reading data from all partitions of kafka. 在结构化流传输的情况下,只有一个专用执行程序负责从kafka的所有分区读取数据。 Does that executor distributes tasks based on no. 该执行者是否根据否分配任务。 of partitions in kafka. 卡夫卡的分区。 The data in a batch get processed sequentially. 批处理中的数据将按顺序处理。 How can that be made to process parallel so as to maximize the throughput. 如何使之并行处理以最大化吞吐量。

I think CustomForEachWriter writer will work on a single row/record of the dataset. 我认为CustomForEachWriter将处理数据集的单行/记录。 If you are using 2.4 version of Spark, you can experiment foreachBatch . 如果您使用的是2.4版本的Spark,则可以尝试foreachBatch But it is under Evolving. 但是它正在演变中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM