简体   繁体   English

Spark Structured Streaming:处理负载是否会影响输入速率/numInputRecords?

[英]Spark Structured Streaming: Does Processing load affect Input Rate/numInputRecords?

My current Structured Streaming application writes to a huge Delta table.我当前的结构化流应用程序写入一个巨大的 Delta 表。 When I (stop the stream) and point it to write to a brand new delta table:当我(停止流)并将其指向一个全新的增量表时:

  1. It becomes much faster - batch duration drops almost 1/4th它变得更快 - 批处理持续时间几乎下降了 1/4
  2. The input rate increases almost 3 times输入率提高近 3 倍

I understand it might become faster since any aggregations/writes it is doing on the older/bigger table is not needed on the new table.我知道它可能会变得更快,因为它在旧/更大的表上进行的任何聚合/写入都不需要在新表上。 But the input rate change is something I am hoping someone can explain?但是输入速率的变化是我希望有人能解释的吗?

Source is Azure EventHubs.来源是 Azure EventHubs。

Thanks!谢谢!

Answering my own question:回答我自己的问题:

The logic behind InputRate and Processing Rate seems to be the below: InputRate 和 Processing Rate 背后的逻辑似乎如下:

Input rate =  numInputRows (or batch size )/ Trigger Interval in secs
Processing Rate = numInputRows (or batch size )/ Batch Duration in secs

without trigger interval, they shd be the same because BatchDuration = Trigger Interval.如果没有触发间隔,它们应该是相同的,因为 BatchDuration = Trigger Interval。

So with a bigger table with lot of partitions, the writes and aggregates take longer which incerases the Batch Duration and thereby decreases the InputRate (and Processing Rate).因此,对于具有大量分区的更大表,写入和聚合需要更长的时间,这会增加批处理持续时间,从而降低 InputRate(和处理速率)。 So that should explain the opposite case for smaller target tables having faster input/processing rates.所以这应该解释具有更快输入/处理速率的较小目标表的相反情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM