简体   繁体   English

Spark Structured Streaming - 输入速率的峰值减少了批处理持续时间

[英]Spark Structured Streaming - Spike in input rate decreases batch duration

I am encountering something that on the first glance of Spark Streaming novice seems counter-intuitive:我遇到的东西乍一看 Spark Streaming 新手似乎违反直觉:

when Spark Structured Streaming starts processing more data, it's batch duration decreases当 Spark Structured Streaming 开始处理更多数据时,它的批处理持续时间会减少

This is probably not the most accurate picture, but I saw much clearer pattern..这可能不是最准确的图片,但我看到了更清晰的图案.. 在此处输入图像描述

I probably need an explanation of what exactly is the batch duration - my understanding is that it represents the the number of seconds it takes Spark to process the streaming's mini-batch.我可能需要解释一下批处理持续时间到底是什么——我的理解是它代表 Spark 处理流的小批量所需的秒数。

Next, I need clarification on how Spark triggers the processing of the mini-batch - whether it's based on the amount of data in the batch or time intervals...接下来,我需要澄清一下 Spark 如何触发小批量的处理——无论是基于批处理中的数据量还是时间间隔......

EDIT编辑
The code is following.代码如下。 There is quite a lot of "heavy" operations (joins, dropDuplicates, filtering with HOF, udfs, ...).有相当多的“繁重”操作(joins、dropDuplicates、使用 HOF 过滤、udfs ......)。 Sink and Source are both Azure Eventhubs Sink 和 Source 都是 Azure Eventhubs

# [CONFIGS]
ehConfig = {
'eventhubs.startingPosition': '{"offset": "@latest", "enqueuedTime": null, 'isInclusive': true,'seqNo': -1}',
'eventhubs.maxEventsPerTrigger': 300,
'eventhubs.connectionString'='XXX'}

ehOutputConfig = {
'eventhubs.connectionString'='YYY' ,   
"checkpointLocation": "azure_blob_storage/ABCABC"
}

spark.conf.set("spark.sql.shuffle.partitions", 3)

# [FUNCS]
@udf(TimestampType())
def udf_current_timestamp():
  return datetime.now()

#-----------#
# STREAMING # 
#-----------#

# [STREAM INPUT]
df_stream_input = spark.readStream.format("eventhubs").options(**_ehConfig).load()

# [ASSEMBLY THE DATAFRAME]
df_joined = (df_stream_input
             .withColumn("InputProcessingStarted", current_timestamp().cast("long"))

             # Decode body
             .withColumn("body_decoded", from_json(col("body").cast("string"), schema=_config))

             # Join customer 
             .join(df_batch, ['CUSTOMER_ID'], 'inner')

             # Filtering
             .filter(expr('body_decoded.status NOT IN (0, 4, 32)'))
             .filter(expr('EXISTS(body_decoded.items, item -> item.ID IN (1, 2, 7))'))

             # Deduplication
             .withWatermark('enqueuedTime', '1 day') 
             .dropDuplicates(['CUSTOMER_ID', 'ItemID']) 

             # Join with lookup table
             .join(broadcast(df_lookup), ['OrderType'], 'left') 

             # UDF
             .withColumn('AssembleTimestamp', udf_current_timestamp())

             # Assemble struct 
             .withColumn('body_struct', struct('OrderType', 'OrderID', 'Price', 'StockPile'))

 # [STREAM OUTPUT]
(df_joined
 .select(to_json('body_struct').alias('body'))
 .writeStream
 .format("eventhubs")
 .options(**_ehOutputConfig)
 .trigger(processingTime='2 seconds')
 .start())

In Spark structured streaming, it triggers new batch as soon as previous batch has finished processing unless you specify trigger option.在 Spark 结构化流中,除非您指定触发选项,否则它会在前一个批次完成处理后立即触发新批次。

In earlier version of Spark with Spark Streaming, we could specify the batch duration of let's say 5 seconds.在带有 Spark Streaming 的早期版本的 Spark 中,我们可以指定批处理持续时间,比如说 5 秒。 In that case, it will trigger micro-batch every 5 seconds and process the data that has arrived in last 5 seconds.在这种情况下,它将每 5 秒触发一次微批处理,并处理最后 5 秒到达的数据。 In case of kafka, it will get data that hasn't been committed.在 kafka 的情况下,它将获取尚未提交的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM