[英]Spark Structured Streaming: Does Processing load affect Input Rate/numInputRecords?
My current Structured Streaming application writes to a huge Delta table.我当前的结构化流应用程序写入一个巨大的 Delta 表。 When I (stop the stream) and point it to write to a brand new delta table:
当我(停止流)并将其指向一个全新的增量表时:
I understand it might become faster since any aggregations/writes it is doing on the older/bigger table is not needed on the new table.我知道它可能会变得更快,因为它在旧/更大的表上进行的任何聚合/写入都不需要在新表上。 But the input rate change is something I am hoping someone can explain?
但是输入速率的变化是我希望有人能解释的吗?
Source is Azure EventHubs.来源是 Azure EventHubs。
Thanks!谢谢!
Answering my own question:回答我自己的问题:
The logic behind InputRate and Processing Rate seems to be the below: InputRate 和 Processing Rate 背后的逻辑似乎如下:
Input rate = numInputRows (or batch size )/ Trigger Interval in secs
Processing Rate = numInputRows (or batch size )/ Batch Duration in secs
without trigger interval, they shd be the same because BatchDuration = Trigger Interval.如果没有触发间隔,它们应该是相同的,因为 BatchDuration = Trigger Interval。
So with a bigger table with lot of partitions, the writes and aggregates take longer which incerases the Batch Duration and thereby decreases the InputRate (and Processing Rate).因此,对于具有大量分区的更大表,写入和聚合需要更长的时间,这会增加批处理持续时间,从而降低 InputRate(和处理速率)。 So that should explain the opposite case for smaller target tables having faster input/processing rates.
所以这应该解释具有更快输入/处理速率的较小目标表的相反情况。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.